Introduction

A little history about Enron company

Enron is a natural-gas-transmission company founded in 1985 in the US. In the 1990s, the US Congress adopted a series of laws to deregulate the sale of natural gas. This caused Enron to lose its exclusivity rights on the natural gas pipeline. During this time, Jeffrey Skilling, who was initially a consultant and later became the company’s chief operating officer, transformed Enron into a trader of energy derivatives, acting as an intermediary between natural-gas producers and their customers. Soon after, Enron became a leader in this market and made huge profits from its trades. This golden age for the company allowed them to recruit Andrew Fastow, who quickly became the chief financial officer. Moreover, they diversified their activities to include electricity, coal, paper, and steel. However, success has its limits, and in the late 1990s, the company’s profits began to shrink. Under pressure from shareholders, company executives started relying on dubious accounting practices, particularly using “mark-to-market accounting,” which allowed the company to record unrealized future gains from some trading contracts as current income, thus giving the illusion of higher current profits. In August 2001, some people at the head of the company began to worry about a possible accounting scandal due to this practice. In October 2001, the Securities and Exchange Commission began investigating Enron’s transactions. This was the starting event that led the company to bankruptcy, which officially began in December 2001.

Source Britannica Enron scandal.

Project aims

The principal aim of this project is to explore the Enron’s email data set for extracting insight about the fiscal fraud investigation and bankruptcy of the company in 2001. For that have 3 data sets:

  • the employee list with their email address

  • the emails exchange from 1999 to 2002

  • the recipients of each emails (to, cc, bcc).

Over this study we will investigate the email exchange by the side of the sender and the recipient. This will be made at 3 levels:

  • without a priori, meaning all the sender and recipient

  • in function of the status

  • for some person know to be imply in the fraud in the company as well as the person found to be the most active in the email exchange.

At each level we will look at the number of email send/received over the study period and analyze the subject and text of email send/received by focus on key words attached to some topics (meeting, business, and enron event).

The different insight will are available into a shiny apps.

For that project we used several libraries listed here: For data exploration, analysis and visualization:

To display the result into the Rmarkdown report:

To create the shiny apps:

#library
library(tidyverse)
library(circlize)
library(wordcloud)
library(ggpubr)
library(patchwork)
library(gridExtra)
library(grid)
library(gtable)
library(ggbreak)
library(knitr)
library(shiny)

#dataset
load(file = "C:/Users/marie/Documents/DSTI_Cours/R_big_Data/Exam/Enron_project/Enron.Rdata")

We design a function to extract the legend which is common to several plot inside a layout to displayed it once. We won’t use it if the legend change between the plot to avoid confusing.

#function to extract the legend from each plot
get_legend <- function(p, #the plot need to be arrange on a same layout and shared the same legend
                       nrow=2 #the number of row where the legend will be display, by default 2
                       ){
  
  #override the guides to control the number of rows in legend
  p_wrapped <- p + guides(
    #allow to control how the legend is arrange 
    fill = guide_legend(nrow = nrow, byrow = TRUE),
    color = guide_legend(nrow = nrow, byrow = TRUE))
  
  #generate a temporary table with the graphical component
  temp <- ggplotGrob(p_wrapped)
  
  #extract the legend, guide-box, and store it in a list
  legend <- temp$grobs[which(sapply(temp$grobs, function(x) x$name) == "guide-box")]
  
  #return only one legend not the list of them
  return(legend[[1]])
} 

Data exploring and cleaning

First look at the data

The aim of this part is to see :

  • which kind of data the different table contains

  • the existence of missing value and how to handle them

employee dataset

Description of the data set variables and dimension:

dim_employee <- dim(employeelist)

summary(employeelist)
##       eid          firstName           lastName           Email_id        
##  Min.   :  1.00   Length:149         Length:149         Length:149        
##  1st Qu.: 38.00   Class :character   Class :character   Class :character  
##  Median : 75.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 75.07                                                           
##  3rd Qu.:112.00                                                           
##  Max.   :150.00                                                           
##                                                                           
##     Email2             Email3             EMail4             folder         
##  Length:149         Length:149         Length:149         Length:149        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##             status  
##  Employee      :41  
##  N/A           :31  
##  Vice President:23  
##  Director      :14  
##  Manager       :14  
##  (Other)       :25  
##  NA's          : 1

This data set contain 149 rows and 9 columns.

This data set contains employee ID (eid), the first and last name of the employee as well as their status, the email addresses for each employee, and the folder where their email are stored. In the status variable there exist missing value’s identify by R (NA) but also putting directly in the data by the set owner which are write N/A. The eid variable is identify has type numeric, status is associate with a factor type and the other variable are character type.

Display of some observations in the data frame:

kable(employeelist[1:10, ])
eid firstName lastName Email_id Email2 Email3 EMail4 folder status
13 Marie Heard heard-m NA
6 Mark Taylor taylor-m Employee
19 Lindy Donoho donoho-l Employee
115 Lisa Gang gang-l N/A
129 Jeffrey Skilling skilling-j CEO
18 Lynn Blair blair-l Director
33 Kim Ward ward-k N/A
149 Kate Symes symes-k Employee
52 Kay Mann mann-k Employee
21 Keith Holst holst-k Director

By looking at the head of the data, we observed that eid is associate to numeric data type but the more adapted type seems to be factor because it is an ID for employee. In addition, the variables Email2, Email3, EMail4 contain a lot of blank.

To investigate the blank we temporary change the datatype of those variables from character to factor to see what kind of result we return for the blank observation.

kable(employeelist %>% transform(
  Email2 = as.factor(Email2),
  Email3 = as.factor(Email3),
  EMail4 = as.factor(EMail4)
) %>% summary())
eid firstName lastName Email_id Email2 Email3 EMail4 folder status
Min. : 1.00 Length:149 Length:149 Length:149 :52 :100 :147 Length:149 Employee :41
1st Qu.: 38.00 Class :character Class :character Class :character a..shankman@enron.com : 1 a..martin@enron.com : 1 j..kean@enron.com : 1 Class :character N/A :31
Median : 75.00 Mode :character Mode :character Mode :character : 1 : 1 : 1 Mode :character Vice President:23
Mean : 75.07 NA NA NA : 1 : 1 NA NA Director :14
3rd Qu.:112.00 NA NA NA b..sanders@enron.com : 1 : 1 NA NA Manager :14
Max. :150.00 NA NA NA : 1 : 1 NA NA (Other) :25
NA NA NA NA (Other) :92 (Other) : 44 NA NA NA’s : 1

We can see that, in the Email2, Email3, and EMail4 variable don’t have missing value but they are blank character. In the Email3 and EMail4 more than the half of the value are blank, maybe those variable aren’t very helpful for the analysis. In the variable status the NA are differently declared where we have 31 values with N/A and only 1 NA. For that variable we will need to replace the N/A by real NA values to homogenized the data.

message data set

Description of the data set variables and dimension:

dim_message <- dim(message)

kable(summary(message))
mid sender date message_id subject
Min. : 52 : 6273 Min. :0001-05-30 : 1 Length:252759
1st Qu.: 88565 : 5838 1st Qu.:2000-12-01 : 1 Class :character
Median :186421 : 5100 Median :2001-05-21 : 1 Mode :character
Mean :190260 : 4797 Mean :1999-04-15 : 1 NA
3rd Qu.:279962 : 4437 3rd Qu.:2001-10-25 : 1 NA
Max. :404927 : 3686 Max. :2044-01-04 : 1 NA
NA (Other) :222628 NA (Other) :252753 NA

This data set contain 252759 rows and 5 columns.

Here we observed that, the mid and date variables identify as a numeric, the variables sender and message_id are attached to factor data type, and the variable subject is character data type.

Display of some observations in the data frame:

kable(message[1:10, ])
mid sender date message_id subject
52 2000-01-21 ENRON HOSTS ANNUAL ANALYST CONFERENCE PROVIDES BUSINESS OVERVIEW AND GOALS FOR 2000
53 2000-01-24 Over $50 – You made it happen!
54 2000-01-24 Over $50 – You made it happen!
55 2000-02-02 ROAD-SHOW.COM Q4i.COM CHOOSE ENRON TO DELIVER FINANCIAL WEB CONTENT
56 2000-02-07 Fortune Most Admired Ranking
57 2000-08-25 WPTF Friday Credo Veritas Burrito
58 2000-06-21 SAP ID - Here it is!!!!!
59 2000-06-27 Set of Graphs
60 2000-07-25 Block Forward Financial Trades
61 2000-07-27 Block forwards

By looking at the head of the data we observed that, the mid don’t look like numeric data but more has identifier like the eid variable in the employeelist table. In the data frame the date variable is associate to a date type. More over it seems that the observation in the subject variable are repeat several time suggesting they aren’t individual string but more a categorical variable.

Because the description seems to treat the variable date as a numeric type but the observation look like real date in the data display above we check with the class() function if R treat it correctly by evaluating if his data type is Date:

class(message$date) == "Date"
## [1] TRUE

The result confirm us R treat the date variable in the good data type meaning Date type. For this variable it is not necessary to adapt the data type.

In the date variable the min and max values return are strange date. In the introduction we saw that the data cover the period between 1999 and 2002 and those value aren’t in that period.

To understand what is those values we filter the table to get the year is less than 1999 or more than 2002:

kable(message %>% 
  select(date) %>% #keep the date variable
  mutate(year = format(date,"%Y")) %>% #extract the year from the date
  filter((year < 1999) | (year > 2002)) %>% #keep the value below and after the study's period
  group_by(year) %>% count()) #count the number of rows per date out of the study's period
year n
0001 205
0002 53
1979 6
1997 1
1998 85
2004 53
2007 1
2020 2
2043 1
2044 3

In filtering the strange date we can see that some aren’t date (0001, 0002) and the other are out of the study’s period. This represent average 450 values which makes less than 1% of the observations in the table.

The variable mid and message_id could be redundancy. To verify that we will count the number of distinct value for both variable to see if a mid could be attached to several message_id.

kable(message%>% select(mid, message_id) %>% #select only the variable we need.
  transform(mid = as.factor(mid)) %>% #transform the mid into factor data type.
  group_by(message_id) %>% 
  count(mid) %>% #count the number of mid per message_id group, create a n variable with the result.
  filter(n != 1)) #filter to get the rows with a value different than 1.
message_id mid n

This shown that, each message_id is attached to one and only one mid and confirm to us the redundancy of the 2 variables in the data frame. To lighten the data we can choose one of them to be kept in the dataframe for the analysis.

As we saw in the table header me have email address of the email’s sender in the sender variable. Those email address are also in the employeelist where it as for most of the employee their status in the company but there are split into 4 different variable. In addition, the variable Email3 and EMail4 contain a lot of blank value. To see how we will can merge the two table we look at the correspondance between the 2 tables for the email address.

#prepared table to only check which email address in the Email_ID are also in the sender
employee_merge1 <- employeelist %>% mutate(sender = Email_id) %>% select(sender)
employee_merge2 <- employeelist %>% mutate(sender = Email2) %>% select(sender)
employee_merge3 <- employeelist %>% mutate(sender = Email3) %>% select(sender)
employee_merge4 <- employeelist %>% mutate(sender = EMail4) %>% select(sender)

#to do the join only with the sender variable
message_merge <- message %>% select(sender)
#first between the sender in the message table and the Email_id in the employeelist
EmailID_sender1 <- inner_join(message_merge, employee_merge1, by = "sender")

EmailID_sender1 %>% count()
##        n
## 1 104766
#between the sender in the message table and the Email2 in the employeelist
EmailID_sender2 <- inner_join(message_merge, employee_merge2, by = "sender")

EmailID_sender2 %>% count()
##   n
## 1 0
#between the sender in the message table and the Email3 in the employeelist
EmailID_sender3 <- inner_join(message_merge, employee_merge3, by = "sender")

EmailID_sender3 %>% count()
##      n
## 1 1170
#between the sender in the message table and the EMail4 in the employeelist
EmailID_sender4 <- inner_join(message_merge, employee_merge4, by = "sender")

EmailID_sender4 %>% count()
##   n
## 1 0

By using the inner_join we can see that, in the employeelist table only the variable Email_id and Email3 have email address which are also in the sender variable of the message table. If we want to get the status of the employee status attached to the sender email address we need to do the merge with those variable.

recipient info data set

Description of the data set variables and dimension:

dim_recipient <- dim(recipientinfo)

summary(recipientinfo)
##       rid               mid         rtype        
##  Min.   :     67   Min.   :    52   BCC: 253713  
##  1st Qu.: 718289   1st Qu.:105438   CC : 253735  
##  Median :1515296   Median :198263   TO :1556994  
##  Mean   :1543862   Mean   :196168                
##  3rd Qu.:2309682   3rd Qu.:280673                
##  Max.   :3242063   Max.   :404927                
##                                                  
##                        rvalue       
##  no.address@enron.com     :  19198  
##  jeff.dasovich@enron.com  :  11137  
##  richard.shapiro@enron.com:  11015  
##  steven.j.kean@enron.com  :  10873  
##  james.d.steffes@enron.com:  10615  
##  tana.jones@enron.com     :   9781  
##  (Other)                  :1991823

This data set contain 2064442 rows and 4 columns. The summary of the data reveal that, the rid and mid are consider as numeric variable by R and the variables rtype and rvalue are consider as factor data type.

Display of some observations in the data frame:

rid mid rtype rvalue
67 52 TO
68 53 TO
69 54 TO
70 55 TO
71 56 TO
72 56 TO
73 57 TO
74 58 TO
75 59 TO
76 60 TO

By looking at the head of this dataset we can see that rid and mid are identifier, with the result return by the summary function we need to transform those variables into factor data for having in the good type. Also, the mid variable is a foreign key allowed to link this table with the message table. Binding together this 2 table will allow us to have the sender and the receiver of the email as well as which type of receiver (direct with the to or “indirect” with the CC and BCC). The last variable rvalue is the email address of the receiver which can be general (e.g., , see in the head of the table) or specific to a person (e.g., , see as the top specific receiver in the summary of that table). The specific email address in the rsender variable can be find in the email addresses in the employeelist variable related to the email address of each employee to get their status in the company. We proceed as with the message table.

#prepared table to only check which email address in the Email_ID are also in the sender
employee_merge1 <- employeelist %>% mutate(rvalue = Email_id) %>% select(rvalue)
employee_merge2 <- employeelist %>% mutate(rvalue = Email2) %>% select(rvalue)
employee_merge3 <- employeelist %>% mutate(rvalue = Email3) %>% select(rvalue)
employee_merge4 <- employeelist %>% mutate(rvalue = EMail4) %>% select(rvalue)

#to do the join only with the sender variable
recipient_merge <- recipientinfo %>% select(rvalue)
#first between the rvalue in the recipient table and the Email_id in the employeelist
EmailID_recipient1 <- inner_join(recipient_merge, employee_merge1, by = "rvalue")

EmailID_recipient1 %>% count()
##        n
## 1 361234
# between the rvalue in the recipient table and the Email2 in the employeelist
EmailID_recipient2 <- inner_join(recipient_merge, employee_merge2, by = "rvalue")

EmailID_recipient2 %>% count()
##   n
## 1 0
#between the rvalue in the recipient table and the Email3 in the employeelist
EmailID_recipient3 <- inner_join(recipient_merge, employee_merge3, by = "rvalue")

EmailID_recipient3 %>% count()
##      n
## 1 2382
#first between the rvalue in the recipient table and the EMail4 in the employeelist
EmailID_recipient4 <- inner_join(recipient_merge, employee_merge4, by = "rvalue")

EmailID_recipient4 %>% count()
##   n
## 1 0

Like in the message table, we only have match between the rvalue and the Email_id and Email3 variable.

reference info data set

Description of the data set variables and dimension:

dim_reference <- dim(referenceinfo)

summary(referenceinfo)
##       rfid            mid          reference        
##  Min.   :    2   Min.   :    79   Length:54778      
##  1st Qu.:14305   1st Qu.: 60580   Class :character  
##  Median :30987   Median :178176   Mode  :character  
##  Mean   :30860   Mean   :179738                     
##  3rd Qu.:46728   3rd Qu.:275557                     
##  Max.   :63024   Max.   :404920

This data set contain 54778 rows and 3 columns.

the summary pointed that, the variable rfid and mid are qualified as numeric type and the reference variable as a character type.

Display of some observations in the data frame:

kable(referenceinfo[5:10, ])
rfid mid reference
5 14 845 From: Monaco, John [EM] [mailto:john.monaco@citi.com]Sent: Thursday, March 07, 2002 6:40 AMTo: Badeer, RobertSubject: FW: RE: Whats up!!!!!Still around!!!!—–Original Message—–From: [mailto:enron.mailsweeper.admin@enron.com] Sent: Thursday, March 07, 2002 9:36 AMTo: Monaco, John [EM]Subject: RE:RE: Whats up!!!!!The enron.com recipient(s) moved to a new organization. The new email address follows the (as per their original enron.comemail address). Email sent to recipient(s) at enron.com will not bedelivered.
6 15 846 From: Rangel, Ina Sent: Thursday, March 07, 2002 8:11 AMTo: Badeer, RobertSubject: Expense ReceiptsBob:I received your expense receipts today. Will submit them today.Ina Rangel
7 16 847 From: Grigsby, Mike Sent: Friday, March 08, 2002 9:08 AMTo: Badeer, RobertSubject: RE: BADGEGo with Ina —–Original Message—–From: Badeer, Robert Sent: Friday, March 08, 2002 11:08 AMTo: Grigsby, MikeSubject: RE: BADGEGrigs, Ina said it would be on the 5th floor of the new building. Which is right? —–Original Message—–From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256
8 17 848 From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256
9 18 849 From: Rangel, Ina Sent: Thursday, March 07, 2002 12:56 PMTo: Badeer, RobertSubject: FW: Badge AccessWhen you get here on Monday morning, come to the 5th floor reception of the new building. If your badge is not there, then I will come and pick you up when you get here and bring you up. Your badge will be ready Monday for sure, whether it be morning or afternoon I am not sure of.-Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:50 PMTo: Rangel, InaSubject: RE: Badge AccessIna,We can most likely have this by Monday morning and he can pick this up at the 5th floor reception. If he has any problems he can call me. Thanks!Mandy —–Original Message—–From: Rangel, Ina Sent: Thursday, March 07, 2002 2:39 PMTo: Curless, AmandaSubject: RE: Badge Access << File: Badge Access Form.doc >> I filled out all of the information that I had on him. Will he be able to have his badge by Monday morning and where will he go to pick it up.Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:00 PMTo: Rangel, InaSubject: Badge Access << File: Badge Access Form.doc >> Ina,Pleae fill out and return to me at ECS 05848. You can e-mail this to me if this is easier. Thanks!Mandy
10 19 851 From: Hyatt, Kevin Sent: Wednesday, July 25, 2001 1:00 PMTo: Nielsen, JeffSubject: RE: Mid 4 to Mid 3 QuoteJeff, can you fill in the rates for the 5,7, and 10 year terms for me. These would be notional of course. Let me know if you have questions.thxKevin 713-853-5559 Term/yrs. 2 5 7 10 Demand: Firm* $.02 - .03 $.04-.05 $.06-.07 $.07-.08 TI $.035 - .045 \(.065-\).075 $.075-.085 $.095-.105 Volume is min. 0 to max of 200,000/d * plus minimum commodity Primary to El Paso Waha would be slightly higher Rates are plus fuel —–Original Message—–From: Nielsen, Jeff Sent: Monday, July 23, 2001 4:39 PMTo: Hyatt, KevinSubject: Mid 4 to Mid 3 QuoteKevin,Jo Williams said that you needed a quote for transportation from Mid 4 to Mid 3 in the Waha area. On a firm basis we would be would in the $.02 to $.03 demand range plus minimum commodity. For a TI rate use between $.035 and $.045. If you would like primary to El Paso Waha, that rate would be a little higher. We have been able to get additional value out of that interconnect because of the gas prices in California. Please let me know if you need any additional information.Jeff 402-398-7434

By looking at the head of that table we can see that:

  • the rfid and mid aren’t numeric variable but look like identifier. It will be necessary to change their data type for factor for it be better adapted.

  • the reference in the referenceinfo table is a variable describing the content of each message. It has also the mid variable which allow us to merge that table with the message and/or the recipientinfo table.

  • in the message and recipientinfo table we have email address like in the employeelist info. We could thinks that, this table can be merged through this.

By exploring those data set we identify some issues needed to be handle before the analysis such as data type change, missing values handling, variable redundancy, and data set merging.

We choose to :

  • Change the data type of the identifier variable in the different table from numeric to factor.

  • Change the data type of the subject variable from character to factor.

  • Withdraw the message_id variable in the message table to lighten the dataset. In addition we drop the lines for which the date aren’t in the study’s period (from 1999 to 2002) and the strange date.

  • Withdraw the variable Email2 and EMail4 variable in the employeelist table because they doesn’t match with the email address in the message and recipientinfo table.

  • Even the referenceinfo table isn’t exhaustive because it contain only 54,778 observation which makes only 2% of the recipientinfo table. We will can analyse a few part of the email exchange.

  • Creates a table which bind all the information about the message by merging together the table message, referenceinfo and recipientinfo through the mid foreign key.

  • We choose to keep the NA in the status for the sender and the receiver. This will allow us to have all the information about the exchange. If they are drop we could loose informations.

Data engineering and cleaning

Employeelist table

employeelist_2 <- employeelist %>% 
  select(-c(Email2, EMail4)) %>% #the variable we don't need in the data
  transform(eid = as.factor(eid)) %>% #data type change for the variable eid to factor
  mutate(status = if_else((status == "N/A"), NA, status)) #homogenized the declaration of the NA in the variable status

Description of the new table employee list:

summary(employeelist_2)
##       eid       firstName           lastName           Email_id        
##  1      :  1   Length:149         Length:149         Length:149        
##  2      :  1   Class :character   Class :character   Class :character  
##  3      :  1   Mode  :character   Mode  :character   Mode  :character  
##  4      :  1                                                           
##  5      :  1                                                           
##  6      :  1                                                           
##  (Other):143                                                           
##     Email3             folder                     status  
##  Length:149         Length:149         Employee      :41  
##  Class :character   Class :character   Vice President:23  
##  Mode  :character   Mode  :character   Director      :14  
##                                        Manager       :14  
##                                        Trader        :13  
##                                        (Other)       :12  
##                                        NA's          :32

Verification of the data type of the table variables:

#return the data type for every variable in the table
str(employeelist_2)
## 'data.frame':    149 obs. of  7 variables:
##  $ eid      : Factor w/ 149 levels "1","2","3","4",..: 13 6 19 115 129 18 33 148 52 21 ...
##  $ firstName: chr  "Marie" "Mark" "Lindy" "Lisa" ...
##  $ lastName : chr  "Heard" "Taylor" "Donoho" "Gang" ...
##  $ Email_id : chr  "marie.heard@enron.com" "mark.e.taylor@enron.com" "lindy.donoho@enron.com" "lisa.gang@enron.com" ...
##  $ Email3   : chr  "" "e.taylor@enron.com" "" "" ...
##  $ folder   : chr  "heard-m" "taylor-m" "donoho-l" "gang-l" ...
##  $ status   : Factor w/ 10 levels "CEO","Director",..: NA 3 3 NA 1 2 NA 3 3 2 ...

The result from summary and the str function show us the data type change, the NA homogenized, and the suppression of the variable is done correctly. We can now used this table to pursue the analysis.

message table

message_2 <- message %>%
  select(-c(message_id)) %>% #withdraw the variable we don't need
  transform(#change the data type for factor
    mid = as.factor(mid),
    sender = as.factor(sender),
    subject = as.factor(subject)) %>%
  #add the year variable in the table from the date
  mutate(year = as.factor(format(date, "%Y"))) %>% 
  #filter to keep only the date from 1999 to 2002
  filter(year %in% c(1999 : 2002)) %>% #drop the year variable which is no more useful in the data
  select(-year)

recipientinfo

recipientinfo_2 <- recipientinfo %>%
  #change the variable data type for factor
  transform(rid = as.factor(rid),
            rvalue = as.factor(rvalue),
    mid = as.factor(mid))

referenceinfo

referenceinfo_2 <- referenceinfo %>%
  #change the variable data type for factor
  transform(rfid = as.factor(rfid),
    mid = as.factor(mid))

Merging the employee status with the df_message table

In first we do it for the sender with Email_id

#prepared the employeelist table for the merge
employee_merge_final <- employeelist_2 %>% 
  select(Email_id, status) %>% #keep only the variables we need
  mutate(status_sender = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message, employee_merge_final, 
                               join_by(sender == Email_id))

#verification the merged work
df_message_status %>% filter(!is.na(status_sender)) %>% count()
##        n
## 1 294291

Then we do it for the sender with Email3

#prepared the employeelist table for the merge
employee_merge_final2 <- employeelist_2 %>% 
  select(Email3, status) %>% #keep only the variables we need
  mutate(status_sender_email3 = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message_status, employee_merge_final2, 
                               join_by(sender == Email3))

#verification the merged work
df_message_status %>% filter(!is.na(status_sender_email3)) %>% count()
##      n
## 1 2034

group all the sender status in to one variable

df_message_status <- df_message_status %>% mutate(
  #replace the NA value in the variable by the value in the 2nd variable
  status_sender = if_else((is.na(status_sender) == TRUE), status_sender_email3, status_sender)) %>% select(-status_sender_email3) #drop the variable

#verification the merged work
df_message_status %>% filter(!is.na(status_sender)) %>% count()
##        n
## 1 296325

With this operation we attached 296 325 sender’s email address to their employee status.Next we the same for the recipient.

In first we do it for the recipient with Email_id

#prepared the employeelist table for the merge
employee_merge_final_recipient <- employeelist_2 %>% 
  select(Email_id, status) %>% #keep only the variables we need
  mutate(status_recipient = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message_status, employee_merge_final_recipient, 
                               join_by(rvalue == Email_id))

#verification the merged work
df_message_status %>% filter(!is.na(status_recipient)) %>% count()
##        n
## 1 291737

Then we do it for the recipient with Email3

#prepared the employeelist table for the merge
employee_merge_final_recipient2 <- employeelist_2 %>% 
  select(Email3, status) %>% #keep only the variables we need
  mutate(status_recipient_email3 = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message_status, employee_merge_final_recipient2, 
                               join_by(rvalue == Email3))

#verification the merged work
df_message_status %>% filter(!is.na(status_recipient_email3)) %>% count()
##      n
## 1 2382

group all the recipient status in to one variable

df_message_status <- df_message_status %>% mutate(
  #replace the NA value in the variable by the value in the 2nd variable
  status_recipient = if_else((is.na(status_recipient) == TRUE), status_recipient_email3, status_recipient)) %>% 
  select(-status_recipient_email3) #drop the variable

#verification the merged work
df_message_status %>% filter(!is.na(status_recipient)) %>% count()
##        n
## 1 294119

By doing this we identify the status of 294 119 employee receiving the email.

Now all the information we need are group in the same data frame, we look at the period which is cover by email content in the reference variable

start <- df_message %>% filter(!is.na(reference)) %>% select(date) %>%
  arrange(date) %>% head(n=1)


end <- df_message %>% filter(!is.na(reference)) %>% select(date) %>%
  arrange(desc(date)) %>% head(n=1)

length_email_content <- df_message %>% filter(!is.na(reference)) %>% count()

We have 268524 with the 1st message is the 1999-05-07 and the last the 2002-07-12. We will can analyse the content a part of message exchange between the Enron employee over this period.

To facilitate the analysis and lightening the data frame we withdraw the identifier columns which aren’t more useful for us and change the name of the rvalue variable for recipient to be more meaning full.

df_message_status <- df_message_status %>% 
  #withdraw the variable which are identifier
  select(-c(mid, rfid, rid)) %>%
  #change the name of the recipient email variable and drop all the space the email address could contain
  mutate(recipient = gsub(" ", "", df_message_status$rvalue),
         sender = gsub(" ", "", df_message_status$sender)) %>%
  #order the different variable
  select(date, sender, status_sender, rtype, recipient, status_recipient, subject, reference)
#cleaning of the object no more necessary in the environment
rm(employeelist, message, message_2, recipientinfo, recipientinfo_2, referenceinfo, referenceinfo_2, df_message_missing, message_merge, recipient_merge, EmailID_sender1, EmailID_sender2, EmailID_sender3, EmailID_sender4, EmailID_recipient1, EmailID_recipient2, EmailID_recipient3, EmailID_recipient4, employee_merge1, employee_merge2, employee_merge3, employee_merge4, end, start, length_email_content, employee_merge_final, employee_merge_final2, employee_merge_final_recipient, employee_merge_final_recipient2, dim_employee, dim_message, dim_recipient, dim_reference)

Data analysis

#in this part we will draw many plot, every will have the same theme
theme_set(theme_light())

We start to make a global picture of the cleaned data we have.

Emailcount <- count(df_message_status %>% filter(rtype == "TO") %>% distinct(sender, recipient, subject, reference))
Reply <- count(df_message_status %>% filter(str_detect(subject, "^RE:")) %>% distinct(sender, recipient, subject, reference))
emailExchangeStatus <- count(df_message_status %>% distinct(sender, status_sender, recipient, status_recipient, subject, reference) %>% filter(!is.na(status_sender)|!is.na(status_recipient)))

In this data set, we have 17501 senders and 67571 recipients. The high difference between the number of senders and recipients suggests that an email involved several people. We have 908151 different direct email exchanges where 9.82 % are replies to former emails. This suggests that most of the emails are information sent or received, with few being real exchanges between workers. Perhaps at that time, workers communicated through other means, such as the telephone. Moreover, among the total email exchanges, only for 31.44 % do we know the status of the sender or the recipient in Enron, suggesting that there are a lot of emails from external sources and/or workers with unidentified statuses. It is also possible that some emails are addressed to email lists that group several employees in the company. For those, we can’t determine the status of the workers.

enronEmailAdd <- count(df_message_status %>% filter((str_detect(sender,"@enron")) | (str_detect(recipient,"@enron"))) %>% distinct(sender, recipient))
Estimation_generalEmailAdd <- count(df_message_status %>% 
                                      #key word regularly used for general email address name and see in the sender or recipient variable
                                      filter(str_detect(sender,                                                                     "^enron|^press|^office|^all|^announcement|^communications|affair|client|contact|secur|team|comit|^west|energy") | str_detect(recipient, "^enron|^press|^office|^all|^announcement|^communications|affair|client|contact|secur|team|comit|^west|energy")))
Exchange_ext_enron <- count(
  #extract the variable we need
  df_message_status %>% select(date, sender, recipient, subject, reference) %>% 
    #count for each the sender and recipient whose have an enron email address
    mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
  count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>% 
    #for each date and subject for each date make the sum of the sender and recipient with an enron email address
    group_by(date, subject) %>% mutate(
      sum_sender = sum(count_sender),
      sum_recipient = sum(count_recipient)) %>% ungroup() %>%
    #isolate the email exchange which not involved person with an enron email address
    filter((sum_sender ==0) & (sum_recipient == 0)))

In our data set, we have 255866, which are emails sent by or addressed to an Enron email address. In fact, the Enron company possesses many client that have their own email domains, it is also possible in those email list to have spam email. This could be the reason why only an average of 30% of the email addresses in those email exchanges are with an Enron email domain. We can also estimate that in those exchanges, an average of 63879 are emails sent or addressed to a general email address that covers several different workers at Enron or one of there clients. We observed that 25212 are emails sent and addressed to people without an Enron domain in their email addresses. These exchanges represent an average of 1% of the total emails in the data set.

#count the number of email address without enron domain for the sender
c1 <- df_message_status %>% distinct(sender) %>% mutate(
  count_tot_sender = n(),
  count_ext_sender = if_else((!str_detect(sender, "@enron")), 1, 0),
  #count_ext_recipient = if_else((!str_detect(recipient, "@enron")), 1, 0),
  sum_ext_sender = sum(count_ext_sender),
  pct_ext_sender = paste0(round((sum_ext_sender/count_tot_sender)*100), "%")
  #sum_ext_recipient = sum(count_ext_recipient)
  ) %>% distinct(sum_ext_sender, pct_ext_sender)

#count the number of email address without enron domain for the recipient
c2 <- df_message_status %>% distinct(recipient) %>% mutate(
  count_tot_recipient = n(),
  count_ext_recipient = if_else((!str_detect(recipient, "@enron")), 1, 0),
  sum_ext_recipient = sum(count_ext_recipient),
  pct_ext_recipient = paste0(round((sum_ext_recipient/count_tot_recipient)*100), "%")
  ) %>% distinct(sum_ext_recipient, pct_ext_recipient)

#bind the both count in the same dataframe
cbind(c1, c2)
##   sum_ext_sender pct_ext_sender sum_ext_recipient pct_ext_recipient
## 1          11457            65%             39313               58%

This highlights that more than half of the senders and recipients do not have an email address with an Enron domain. This suggests that the email exchanges may be more between Enron employees and the company’s clients. It is also possible that the emails are sent to or from personal email addresses of Enron employees, maybe in the case of informal exchanges.

From this initial overview of the data, we can deduce that:

  • The dataset we have is not exhaustive regarding the status of employees in the company as well as the content of the emails.

  • A lot of exchanges are conducted with external persons. But most of the exchanges involve Enron employees where less than 10% of the emails are sent to or from addresses without an Enron domain.

  • It seems that few emails are real exchanges between employees, as we have few emails containing “RE:” in their subject.

  • A small part of the email exchanges seems to be between people who are external to the Enron company. Although they represent a negligible part of the total dataset, we will keep them in the dataset for further analysis.

Given this, we decide to include the employees without status to avoid losing any information about the email exchanges and to keep the external email addresses for the analysis.

the employee liste

To explore the number of employee we have per different status, we used the employeelist2 data frame which contain the email address, the name, and the status of the enron worker.

Number of employee per status :

employeelist_2 %>% select(status) %>% #select the needed variable
  group_by(status) %>% count() %>% #count the number of employee per status
  ungroup() %>%
  #calculate the percentage for each status
  mutate(perc = `n`/sum(`n`),
  labels = scales::percent(perc)) %>%
  #bar chart
  ggplot(aes(reorder(status, perc ,sum),perc, fill = status)) +
  geom_bar(stat = "identity") +
  #to invert the axis's position
  coord_flip()+ 
  #customize the theme, title and axis labels
  geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
  scale_y_continuous(labels = scales::percent_format())+
  ggtitle("Percentage of employee in the employee list with a know status")+
  labs(y = "Percentage (%)",x = "Employee status") +
  scale_fill_brewer(palette = "Set3", 
                    #to display the NA in grey on the graph
                    na.value = "grey50"
                    )+
  theme(legend.position = "none")

The above bar chart shows us that:

  • Most of the employees have an ‘employee’ or ‘unknown’ status (27.48% and 21.48% respectively).

  • There are few lawyers (less than 1% of the total number of employees).

  • Surprisingly, a lot of employees have a ‘vice president’ status (an average of 15%).

  • There is a similar number of managers, directors, and traders in the company (an average of 9% for each).

  • At the head of the company, there are several CEOs, Presidents, and Managing Directors (an average of 2% for each).

After that we look at the email exchange in the period of study In first we extract from the date the month and year and put them into different variable.

df_message_status <- df_message_status %>% 
  mutate(year = format(date,"%Y"), #extract the year from the date
         month = format(date, "%m")) %>% #extract the month from the date 
  transform( #to put the variable in wright type
    year = as.factor(year),
    month = as.factor(month))
df_message_status %>% group_by(year,month)%>%
  count() %>%
  ggplot(aes(month, n, group = year, color = year))+
  geom_line(size = 1)+
  scale_y_continuous(labels = scales::label_comma())+
  labs(title = "Number of email sent/received per month by the Enron's worker",
       x = "Month",
       y = "Number of emails")+
  scale_fill_brewer(palette = "Set3")

The above plot shows that:

  • For the year 1999, the email exchange is low. We find the same rate in April 2002.

  • Over the year 2000, the number of emails exchanges between Enron’s workers increased gradually, reaching its highest level in November 2000.

  • In the year 2001, we see a peak of email exchanges during April and May. This period in 2001 is when the fiscal fraud began to be discovered. Then, the number of exchanges decreased during the summer, only to peak again in October, which is also the period when the company was under SEC investigation.

  • The email exchanges stopped in May 2002, possibly the date when the company was completely closed. At the start of 2002 (in January and February), we still see a high number of emails exchanges. This may be due to the completion of the fiscal fraud investigation and its consequences for the company.

Description of the number of emails sent and receive

First of all in the df_message we count the distinct email address for the sender and recipient as well as often they appear in the table:

#count the number of disctint sender email address
sender_count <- df_message_status %>% select(sender) %>% #keep only the variable we need
  distinct(sender) %>% #keep only once each email address 
  count() #count them
#count the number of disctint recipient email address
recipient_count <- df_message_status %>% select(recipient) %>% distinct(recipient) %>% count()

In the df_message table, we observed that there exist 67571 different email addresses for the receiver and 17501 different email addresses for the sender. The important difference between them suggests one email is addressed to several people.

To picture in the company who is the type of Enron’s worker the most active in the email exchange, we look at the number of emails sent and received by each status and then compare them.

Start with the email sent.

#compute the number of emails send per day per employee status
violin_worker <- df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>%
  summarise(email_count = n(), .groups = "drop")

#violin plot 
ggplot(violin_worker, aes(as.factor(status_sender), email_count, fill = as.factor(status_sender))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  ylim(c(0,250))+
  stat_compare_means(method = "anova", label.y = 250, size = 4)+
  labs(title = "Number of emails sent based on the status",
       x = "Source",
       y = "Number of emails") +
  theme(legend.position = "none")

The above plot shown us that, the employee are those who send the higher number of emails in the company. The anova test show us the difference between the group is significant.

Table with the descriptive statistic for each group

#descriptive statistics between the worker status group
violin_worker %>% group_by(status_sender)%>%
  summarise(
    mean = mean(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 9 × 7
##   status_sender       mean     sd   min    Q1    Q3   max
##   <fct>              <dbl>  <dbl> <int> <dbl> <dbl> <int>
## 1 CEO                37.7  284.       1     3  17    4740
## 2 Director           27.7   41.4      1     3  39     298
## 3 Employee          159.   271.       1    13 186.   4085
## 4 In House Lawyer     7.29   7.12     1     2   9      35
## 5 Manager            47.9   69.0      1    11  62    1044
## 6 Managing Director  10.7   32.2      1     2   8     455
## 7 President          29.6   75.5      1     3  26     988
## 8 Trader             17.6   24.0      1     4  23     307
## 9 Vice President     74.5  116.       1    12  89.8  1014
#statistical comparison between group
pairwise.t.test(violin_worker$email_count, violin_worker$status_sender, 
                #adjust the p.value with bonferroni because the number of group is small
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  violin_worker$email_count and violin_worker$status_sender 
## 
##                   CEO     Director Employee In House Lawyer Manager
## Director          1.000   -        -        -               -      
## Employee          < 2e-16 < 2e-16  -        -               -      
## In House Lawyer   1.000   1.000    < 2e-16  -               -      
## Manager           1.000   1.000    < 2e-16  0.154           -      
## Managing Director 1.000   1.000    < 2e-16  1.000           0.017  
## President         1.000   1.000    < 2e-16  1.000           1.000  
## Trader            1.000   1.000    < 2e-16  1.000           0.032  
## Vice President    0.022   7.0e-05  < 2e-16  5.7e-05         0.047  
##                   Managing Director President Trader 
## Director          -                 -         -      
## Employee          -                 -         -      
## In House Lawyer   -                 -         -      
## Manager           -                 -         -      
## Managing Director -                 -         -      
## President         1.000             -         -      
## Trader            1.000             1.000     -      
## Vice President    2.5e-08           8.3e-05   2.9e-09
## 
## P value adjustment method: bonferroni

The tables above describe the number of emails sent per day for each status and compare each group. This confirms the first observations shown in the violin plot, where:

  • Employees are the group that sends the highest number of emails per day on average. Employees are also the largest group of workers in the company, which may influence this result.

  • After them, vice presidents and managers send the highest number of emails per day. This may be related to their roles in the company.

Previously, we pointed out that employees are the largest group in Enron’s company. To confirm that they are the most active group in terms of email sending, we rationalize the number of emails sent per day for each group in relation to the number of Enron workers per group.

#Filter to get only the worker with a knowing status
df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>% 
  #count the number of emails sent per day per group as well as the distinct number of worker in each group at this date
  mutate(
    nb_send = n(),#count for each group the total number of sender for a date
    nb_sender_per_gp = n_distinct(sender) #for each status count the number of different sender email address we have for a date
  ) %>% ungroup()%>% 
  #made the ratio between the email send per day for each status and the number of distinct sender in that status for that day
  mutate(ratio_nb_email = nb_send/nb_sender_per_gp) %>%
  #violin box plot
  ggplot(aes(status_sender, ratio_nb_email, fill = status_sender)) +
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  stat_compare_means(method = "anova", label.y = 2500, size = 4)+
  labs(title = "Number of emails sent based on the status",
       subtitle = "Ratio to the number of worker per group.",
       x = "Source",
       y = "Ratio\n(number of workers per status/number of emails per status)")+
  theme(legend.position = "none")

If we rationalize the number of emails sent per day, it seems that generally, the amount is close to zero. Maybe between 0 and 10 for the first quartile. Surprisingly, it is the CEO who sends the highest average number of emails per day, which contradicts our previous observations when looking at the raw number of emails sent per day in relation to worker status. Perhaps the violin plot suggests a significant difference between the lower and higher amounts of emails sent per day for them. The average might be higher due to some extreme values.

#Description of the email send for each status
df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>% mutate(
    #count the number of sender in each group
  nb_send = n(),
  #count the number of distinct sender in each group
  nb_sender_per_gp = n_distinct(sender)) %>% 
  ungroup()%>% 
  #make the ratio  
  mutate(ratio_nb_email = nb_send/nb_sender_per_gp) %>%
  distinct(date,status_sender, sender, nb_send, nb_sender_per_gp, ratio_nb_email) %>% 
  group_by(status_sender)%>% 
  #description of the email send rationalize to the number of distinct sender in each status
  summarise(
    mean = mean(ratio_nb_email),
    median = median(ratio_nb_email),
    sd = sd(ratio_nb_email),
    min = min(ratio_nb_email),
    Q1 = quantile(ratio_nb_email, 0.25),
    Q3 = quantile(ratio_nb_email, 0.75),
    max = max(ratio_nb_email)
  )
## # A tibble: 9 × 8
##   status_sender      mean median     sd   min    Q1    Q3   max
##   <fct>             <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CEO               32.0    7    189.       1  3    15    2370 
## 2 Director          12.0    7     15.6      1  3    14.5   194 
## 3 Employee          23.7   16.1   25.9      1 10.7  25.7   348 
## 4 In House Lawyer    7.29   5      7.12     1  2     9      35 
## 5 Manager           11.3    8.43  13.0      1  5.17 13.2   201.
## 6 Managing Director  9.96   3.5   25.7      1  2     7.5   228.
## 7 President         20.3    9     59.0      1  3    18     988 
## 8 Trader             7.59   5      8.03     1  2.67  9.12   81 
## 9 Vice President    15.3   11.2   14.9      1  6.8  18.3   206

After rationalizing the number of emails sent per worker in the group, we can see that the average for the CEO is around 32 emails per day with a median of 7, while the average for the employees is around 23 with a median of 16, suggesting that the average for the CEO is pushed higher by some extreme values. Indeed, the maximum for the CEO is 2,370 and for the employees it is 348. This could be the reason why the CEO appears to send a higher number of emails per day. To understand why there is this extreme value, we researched the date linked to it.

To understand what happen we look closely to the CEO group and highlight the 10 higher values for the number of email send.

df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>% mutate(
  nb_send = n(),
  nb_sender_per_gp = n_distinct(sender)) %>% ungroup()%>% 
  mutate(ratio_nb_email_pctg = nb_send/nb_sender_per_gp) %>%
  distinct(date,status_sender, sender, nb_send, nb_sender_per_gp, ratio_nb_email_pctg) %>% 
  #look especially to the CEO status
  filter((status_sender == "CEO") & (ratio_nb_email_pctg == "2370"))
## # A tibble: 2 × 6
##   date       status_sender sender   nb_send nb_sender_per_gp ratio_nb_email_pctg
##   <date>     <fct>         <chr>      <int>            <int>               <dbl>
## 1 2001-08-23 CEO           kenneth…    4740                2                2370
## 2 2001-08-23 CEO           david.w…    4740                2                2370

Effectively the maximum number of emails send by the CEO was in August, 2001 the period where the CEO start to be worried about the risk of the fiscal fraud could be discover by the fiscal authorities.

#environment cleaning
rm(jeff_stat, sender_stat, statuts_stat, p1, p2, p3, p4, violin_plot, violin_plot1, violin_plot2, violin_worker)

Now we look at the email received by each Enron’s worker status

#compute the number of email send per day per employee status
violin_worker <- df_message_status %>%   filter(!is.na(status_recipient)) %>%
  group_by(date, status_recipient) %>%
  summarise(email_count = n(), .groups = "drop")

#violin plot 
ggplot(violin_worker, aes(as.factor(status_recipient), email_count, fill = as.factor(status_recipient))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  ylim(c(0,250))+
  stat_compare_means(method = "anova", label.y = 250, size = 4)+
  labs(title = "Number of emails received based on the status",
       x = "Source",
       y = "Number of emails") +
  theme(legend.position = "none")

The employee, manager, and vice president seems to be the workers group in Enron’s company who receive the higher number of emails. It seems that, the in house lawyer are those who receive the less number of emails per days. The difference between group is significant.

Descriptive statistics and comparison between groups:

#description of the email received by each status
violin_worker %>% group_by(status_recipient)%>%
  summarise(
    mean = mean(email_count),
    median = median(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 9 × 8
##   status_recipient   mean median     sd   min    Q1    Q3   max
##   <fct>             <dbl>  <dbl>  <dbl> <int> <dbl> <dbl> <int>
## 1 CEO               11.6       6  15.3      1     2  15     197
## 2 Director          35.6      18  61.7      1     5  38     676
## 3 Employee          98.6      40 156.       1     7 122.   1333
## 4 In House Lawyer    5.64      3   8.14     1     1   6.5    62
## 5 Manager           42.2      28  53.1      1    10  55     438
## 6 Managing Director 18.0       6  30.4      1     2  18     178
## 7 President         22.9      10  32.4      1     3  29     224
## 8 Trader            39.8      12  70.6      1     3  42     538
## 9 Vice President    85.8      32 130.       1     7 122.   1140
#statistical comparison between group
pairwise.t.test(violin_worker$email_count, violin_worker$status_recipient, 
                #adjust the p.value with bonferroni because the number of group is small
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  violin_worker$email_count and violin_worker$status_recipient 
## 
##                   CEO     Director Employee In House Lawyer Manager
## Director          9.4e-05 -        -        -               -      
## Employee          < 2e-16 < 2e-16  -        -               -      
## In House Lawyer   1.00000 5.9e-05  < 2e-16  -               -      
## Manager           2.4e-08 1.00000  < 2e-16  8.7e-08         -      
## Managing Director 1.00000 0.01940  < 2e-16  1.00000         3.3e-05
## President         0.86132 0.35860  < 2e-16  0.18185         0.00190
## Trader            9.8e-07 1.00000  < 2e-16  1.5e-06         1.00000
## Vice President    < 2e-16 < 2e-16  0.06459  < 2e-16         < 2e-16
##                   Managing Director President Trader 
## Director          -                 -         -      
## Employee          -                 -         -      
## In House Lawyer   -                 -         -      
## Manager           -                 -         -      
## Managing Director -                 -         -      
## President         1.00000           -         -      
## Trader            0.00058           0.02020   -      
## Vice President    < 2e-16           < 2e-16   < 2e-16
## 
## P value adjustment method: bonferroni

Again, it is the employees who receive the highest number of emails per day. They show the highest mean, which is close to that of the vice presidents. In addition, the standard deviation for these two groups is significant and may overlap. This explains why the number of emails received per day for the employee group isn’t significantly higher compared to the vice president group. The employee group is the largest in the company (27% of the workforce), while the vice presidents represent only 9% of the workforce. Perhaps the reason they also receive a high number of emails is because of their position in the company. The manager group is also one of the groups that receive the highest number of emails per day. Perhaps, like the vice president group, it is because of their position in the company. After these groups, we find the traders and directors, who also receive a high number of emails per day.

Like for the email send we look if those result are confirm if we rationalize the number of emails received per day for each group in function of the number of worker in that group.

#Filter to get only the worker with a knowing status
df_message_status %>% filter(!is.na(status_recipient)) %>%
  group_by(date, status_sender) %>% 
  #count the number of emails received per day per group as well as the distinct number of worker in each group at this date
  mutate(nb_received = n(),
  nb_received_per_gp = n_distinct(recipient)) %>% 
  ungroup()%>% 
  #made the ratio between the email send per day for each group and the number of worker in that group for that day
  mutate(ratio_nb_email = nb_received/nb_received_per_gp) %>%
  #violin box plot
  ggplot(aes(status_recipient, ratio_nb_email, fill = status_recipient)) +
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
   stat_compare_means(method = "anova", label.y = 70, size = 4)+
  labs(title = "Number of email received based on the status",
       subtitle = "Ratio to the number of workers per group.",
       x = "Source",
       y = "Ratio\n(number of workers per status/number of emails per status)")+
  theme(legend.position = "none")

#Description of the email received by each status rationalize to the number of distinct recipient per status
df_message_status %>% filter(!is.na(status_recipient)) %>%
  group_by(date, status_sender) %>% 
  mutate(nb_received = n(),
  nb_received_per_gp = n_distinct(recipient)) %>% 
  ungroup()%>% 
  mutate(ratio_nb_email = nb_received/nb_received_per_gp)%>%
  #keep only distinct value
  distinct(date,status_recipient, recipient, nb_received, nb_received_per_gp, ratio_nb_email) %>% 
  #make the descriptive statistics for each recipient group
  group_by(status_recipient)%>% summarise(
    mean = mean(ratio_nb_email),
    median = median(ratio_nb_email),
    sd = sd(ratio_nb_email),
    min = min(ratio_nb_email),
    Q1 = quantile(ratio_nb_email, 0.25),
    Q3 = quantile(ratio_nb_email, 0.75),
    max = max(ratio_nb_email)
  )
## # A tibble: 9 × 8
##   status_recipient   mean median    sd   min    Q1    Q3   max
##   <fct>             <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CEO                5.69   4.35  5.02     1  2.86  6.52  67.8
## 2 Director           6.54   4.81  5.70     1  3.33  7.83  48.7
## 3 Employee           6.19   4.56  5.60     1  3.04  7.13  67.8
## 4 In House Lawyer    6.99   5.32  5.88     1  3.71  8.25  40.9
## 5 Manager            6.26   4.76  5.38     1  3.2   7.29  67.8
## 6 Managing Director  6.35   4.46  6.32     1  2.74  7.28  67.8
## 7 President          5.34   4.12  4.79     1  2.48  6.25  56.1
## 8 Trader             7.17   5.27  6.55     1  3.51  8.43  67.8
## 9 Vice President     5.53   4.17  4.99     1  2.67  6.41  67.8

If we rationalize the number of email received by the number of worker in each group we can see it still have a significant difference between status. Perhaps, the difference between group isn’t contrasted as what is seen for the email sent. We can think that it has more worker who received email than those who sent them each day. Maybe we have a significant p-value because the large number of emails increase the statistical power, making easier to get significance.

#count the number of email send and received per day in function of their status
send_vs_received <- df_message_status %>% 
  group_by(date, status_sender) %>% 
  mutate(nb_sender_per_group = n_distinct(sender)) %>% ungroup()%>%
  group_by(date, status_recipient) %>% 
  mutate(nb_recipient_per_group = n_distinct(recipient)) %>% ungroup()

send_vs_received <- as.data.frame(send_vs_received)
  
#descriptive statistic for both the sender and recipient
send_vs_received %>% 
  summarise(
    across(c(nb_sender_per_group,nb_recipient_per_group),
           list(mean = ~mean(.x),
                median = ~median(.x),
                sd = ~sd(.x),
                min = ~min(.x),
                Q1 = ~quantile(.x,0.25),
                Q3 = ~quantile(.x,0.75),
                max = ~max(.x))))
##   nb_sender_per_group_mean nb_sender_per_group_median nb_sender_per_group_sd
## 1                 206.8242                        159               185.2247
##   nb_sender_per_group_min nb_sender_per_group_Q1 nb_sender_per_group_Q3
## 1                       1                     80                    281
##   nb_sender_per_group_max nb_recipient_per_group_mean
## 1                    1328                    1248.807
##   nb_recipient_per_group_median nb_recipient_per_group_sd
## 1                          1168                  848.2686
##   nb_recipient_per_group_min nb_recipient_per_group_Q1
## 1                          1                       618
##   nb_recipient_per_group_Q3 nb_recipient_per_group_max
## 1                      1930                       3145
#boxplot to vizualised the descriptive statistic
p1 <- send_vs_received %>% filter(!is.na(status_sender)) %>%
  ggplot(aes(status_sender, nb_sender_per_group, fill = status_sender))+
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  labs(title = "Number of persons who sent email per status",
       x = "Source",
       y = "Number of persons")+
  theme(legend.position = "none")

p2 <- send_vs_received %>% filter(!is.na(status_recipient)) %>%
  ggplot(aes(status_recipient, nb_recipient_per_group, fill = status_recipient))+
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  labs(title = "Number of persons who received email per status",
       x = "Source",
       y = "Number of persons")+
  theme(legend.position = "none")

p1/p2

We can see that, it as in average more person in a group who receive email each day compared to the number of person who send them. This is especially true for the worker in the employee, trader, vice president, and director groups.

In general, it is the employees who are more active in email exchanges. When we rationalize the number of emails sent in relation to the number of workers per group, we can see that employees are really the most active in sending emails, but at some point, the CEO group sent a high number of emails due to Enron’s events. If we look at the number of emails received in relation to the number of workers in a group, we see no real difference between the groups, suggesting that more people receive emails each day than send them

Next we take a look at the flux of the email exchange between the different status over the study period to see if it change because of the enron company events. We now look at the exchange between worker with a knowing status in the company and per year we draw chord diagram which allows to follow the emails flux between group.

#plot for each year follow the exchange between group
per_year <- df_message_status %>% select(date, status_sender, status_recipient) %>%
  filter(!is.na(status_sender) & !is.na(status_recipient)) %>%
  mutate(year = format(date,"%Y"),
         #to enhance the clarity we group certain status with similar level of responsability together
         status_sender = case_when(
           status_sender %in% c("Managing Director", "Manager", "Director") ~ "Manger - Director",
           status_sender %in% c("CEO", "Vice President", "President") ~ "CEO - President",
           .default = status_sender),
         status_recipient = case_when(
           status_recipient %in% c("Managing Director", "Manager", "Director") ~ "Manger - Director",
           status_recipient %in% c("CEO", "Vice President", "President") ~ "CEO - President",
           .default = status_recipient)) %>%
  group_by(date,status_sender, status_recipient) %>%
  #count the number of email exchange for a couple of status sender/recipient per date
  mutate(number_exchange = n()) %>% ungroup() %>%
  distinct(date, status_sender, status_recipient, number_exchange, year)

#For each year we create a dataframe with the number of email exchange between each status
year_1999 <- as.data.frame(per_year %>% filter(year == 1999) %>%
  group_by(status_sender, status_recipient) %>%
    #sum for each couple for the year
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    #keep only the exchange between different status
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

year_2000 <- as.data.frame(per_year %>% filter(year == 2000) %>%
  group_by(status_sender, status_recipient) %>%
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

year_2001 <- as.data.frame(per_year %>% filter(year == 2001) %>%
  group_by(status_sender, status_recipient) %>%
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

year_2002 <- as.data.frame(per_year %>% filter(year == 2002) %>%
  group_by(status_sender, status_recipient) %>%
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

#the color for each status
status_color <- c(
  "Employee" = "pink",
  "CEO - President" = "orange",
  "Trader" = "springgreen3",
  "Manger - Director" = "violetred4",
  "In House Lawyer" = "purple4")

Display the chord diagram of the year 1999

adjacencyData_99 <-with(year_1999, table(status_sender, status_recipient))
chordDiagram(adjacencyData_99, transparency = 0.5, grid.col = status_color)

year 2000

adjacencyData_00 <-with(year_2000, table(status_sender, status_recipient))
chordDiagram(adjacencyData_00, transparency = 0.5, grid.col = status_color)

year 2001

adjacencyData_01 <-with(year_2001, table(status_sender, status_recipient))
chordDiagram(adjacencyData_01, transparency = 0.5, grid.col = status_color)

year 2002

adjacencyData_02 <-with(year_2002, table(status_sender, status_recipient))
chordDiagram(adjacencyData_02, transparency = 0.5, grid.col = status_color)

For the email exchange, we can see that:

  • In 1999, the trader exchanged emails only with employees, but later, they also exchanged with managers/directors and the CEO/president. Surprisingly, it seems the trader never exchanged directly with the in-house lawyer. Perhaps their email exchanges were indirect.

  • In 2002, the in-house lawyer received emails only from the manager/director. During this period, we do not see email exchanges from the in-house lawyer to other company workers with a known status. Perhaps they sent emails to external persons for managing the company’s bankruptcy with the information they received from the manager and director.

  • The in-house lawyer exchanged emails in 2000 only with the manager/director and the CEO/president, but in 2001, they also exchanged with employees. The change in the email flow for the in-house lawyer might be related to the Enron event, where there could have been a need to inform employees about some matters so they could respond to SEC investigations.

This last analyze highlight the change in the email flux over the study period. Some change could be linked with the Enron event.

The number of email send/receive per month over the year.

The data set covers the email exchanges between Enron’s workers from 1999 to 2002. From 1999 to early 2001, the company was in good health. Starting in the middle of 2001, the company’s fraud became public and put the company in trouble. Through the email history, we will look at whether the number of emails sent and received changed over the months in relation to the workers’ status.

We look over the month of each year which are the worker status the most active. For the employee.

#list of status in the Enron company
status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")

month_label <- c("01" = "January","02" = "February","03" = "March","04" = "April","05" = "May","06" = "June","07" = "July","08" = "August",
               "09" = "September","10" = "October","11" = "November","12" = "December")

month_color <- c("01" = "lightgreen","02" = "lightsalmon4","03" = "lightblue","04" = "greenyellow","05" = "cyan","06" = "darkgreen","07" = "lavender",
               "08" = "plum","09" = "coral","10" = "honeydew4","11" = "hotpink","12" = "indianred")

#initiate the list for the plot
email_send <- list()

#loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
for(i in seq(status_list)){
  
  status <- status_list[i]
  
  p <- df_message_status %>% filter(status_sender == status) %>% #take the value in the list
  group_by(year,month)%>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email sent per month for each year by the", status),
       y = "Number of emails")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  email_send[[i]] <- p}

#display the plot create
n <- length(email_send)

plot_per_section <- 3

for(j in seq(1,n,by=plot_per_section)){
  
  plot_on_the_page <- email_send[j:min(j+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

By looking year by year we can see that:

  • The highest number of email is sent in 2001 by all status which is the year where the company face the SEC investigation as well as the bankruptcy start.

  • It is the workers with employee status who send the highest number of emails in the different years. The number of emails sent follows the trend we observed when we look at all Enron’s workers, suggesting that the employees influence the general email exchange number per month in the company. This could be linked to the number of employees in the company. In 2001, the employee group was the one who sent the highest number of emails.

  • The CEO appears in the emails sent from January 2000, which is the moment their role is formally declared in the company. They send a high number of emails compared to directors and managing directors. Especially in the year 2001, in April, May, October, and November, they send an important number of emails. This may be related to the fiscal fraud investigation.

  • In the year 2001, the number of emails sent by the in-house lawyer is the highest compared to the other years, suggesting they are involved in managing the fiscal fraud investigation inside the company.

  • The traders are the third group who send a high number of emails per month, which is logical given the company’s activity.

Now we look for the email receive in function of the Enron’s worker status.

#initiate the list for the plot
email_received <- list()

#loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
for(i in seq(status_list)){
  
  status <- status_list[i]
  
  p <- df_message_status %>% filter(status_recipient == status) %>% #take the value in the list
  group_by(year,month)%>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email received per month for each year by the", status),
       y = "Number of emails")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  email_received[[i]] <- p}

#display the plot create
n <- length(email_received)

plot_per_section <- 3

for(j in seq(1,n,by=plot_per_section)){
  
  plot_on_the_page <- email_received[j:min(j+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

The plot above shows that: - Like for the email sent, all the status received an important number of email in 2001. The pattern of the email received follow the same trend as what we saw for the emails sent, suggesting they are exchange.

  • Also here it is again the employees who receive the highest number of emails.

  • The traders seem to receive more emails than they send.

  • For the group at the head of the company (CEO, Managing Director, Director, President, and Vice President), the number of emails received follows the Enron’s fiscal fraud event with high spikes in April, May, October, and November of 2001.

  • In 2001, the Vice President group received a lot of emails compared to the other head groups of the company.

  • In 2001, the in-house lawyer group seemed to receive the highest number of emails.

For both the email sent and received we see a peak of email for the months October and November in the year 2001 which is the period of the SEC investigation. Moreover, their all received more email than they sent suggesting one email will touch a lot of person in the company. From those graph we can think that for all the status it has a lot of email exchange, maybe to manage the SEC investigation inside the company.To understand at which type of exchange those emails could related we will look at some topics which are link to the events (investigation and bankruptcy) as well as to the business process of the company.

#environment cleaning
rm(jeff_stat, recipient_stat, statuts_stat, violin_plot, violin_plot1, violin_plot2, violin_worker, p1, p2, send_vs_received)

Analysis of the email subject and content

In our data set we have 2063706 rows with email content which represent 10%. This make the email content is few exhaustive compared to the email subject which is describe for every email exchange.

String_var_stat <- df_message_status %>% distinct(reference, subject) %>% mutate(
  emailTextLength = str_count(reference,
                              #specify in regex we want to count the number of word or sequence of character without space between them
                              "\\S+"),
  emailSubjectLength = str_count(subject,"\\S+")) 

summary(String_var_stat)
##   reference              subject       emailTextLength   emailSubjectLength
##  Length:157194      RE:      :  2744   Min.   :    0.0   Min.   : 0.000    
##  Class :character   FW:      :   585   1st Qu.:   71.0   1st Qu.: 3.000    
##  Mode  :character   RE: Hello:    82   Median :  147.0   Median : 4.000    
##                     RE: Hi   :    56   Mean   :  244.9   Mean   : 4.899    
##                              :    52   3rd Qu.:  288.0   3rd Qu.: 6.000    
##                     RE: Lunch:    48   Max.   :10153.0   Max.   :49.000    
##                     (Other)  :153627   NA's   :110536

In average the email text contain 245 words and the subject 30. We have 52 subject which are blank, most of the subject only contain RE: or FW:, for both the original subject is hidden. This suggest the top email subject is reply to another email and email transfer between worker.

To investigate the subject and text of the emails we have, we created 4 lists of different topics which will be researched in the email subject:

  • Emails related to meetings by looking for words such as message, please, email, inform.

  • Emails related to business processes and business legalities such as enron, deal, change, corp, date, america.

  • Emails related to the core business of Enron like gas, power, trade.

These keywords come from the wikipedia page about Enron timeline downfall. Each word/concept will be researched individually in the email content to follow the email exchanges containing them as well as the Enron workers’ status implied in those exchanges.The analysis is conducted over the study period to highlight periods where these topics/keywords are more used by the Enron workers. Then we will look if there are worker statuses that used them more than others to finally look at some specific Enron workers known to be involved in the Enron events.

Research of the 4th topics in the email subjects as well as key word in email content.

#topics list 

topic_meeting <- c("message|origin|pleas|email|thank|attach|file|copi|inform|receiv|thank|all|time|meet|look|week|day|dont|vinc|talk")

topic_business_process <- c("enron|deal|agreement|chang|contract|corp|fax|houston|date|america|risk|analy|confidential|correction")

topic_core_business <- c("market|gas|price|power|company|energy|trade|busi|servic|manag")

topic_enron_event <- c("bankrup|SEC|MTM|fear|losing money|10-K|fears|investigation|phone|fax|document|testimony|witness|deposition")
#construction of the data set for measuring the frequency of the different topic in the email subject as well as the number of email with specific word, we focus on the sender status

email_subject_send <- df_message_status %>% distinct(date, year, month, sender, status_sender, subject, reference) %>%
  mutate(#count the number of email which contain at least one word in the list of each topic
    subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
    subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
    subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
    subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
    email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
    email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
    email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
    email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
    #to get the date in year/month
    year_month = as.Date(paste0(year,"-",month,"-01"))) 

In the following part we will create plot which will represent the email exchange about specific topics. To homogenized the apparent of those plot we declared a color and a label for each category for they can be apply at every plot.

#the list of category studied and their related color in each plot
topic_colors <- c("sum_subject_business_process" = "steelblue4",
                  "sum_subject_core_business" = "orchid",
                  "sum_subject_meeting" = "chocolate4",
                  "sum_subject_enron_event" = "yellowgreen",
                  "sum_email_business_process" = "cyan3",
                  "sum_email_core_business" = "plum4",
                  "sum_email_meeting" = "salmon",
                  "sum_email_enron_event" = "springgreen4")



#the list of category and their related label on the plot  
topic_label <- c("sum_subject_business_process" = "Business process email subject",
                 "sum_subject_core_business" = "Core Business email subject",
                 "sum_subject_meeting" = "Meeting email subject",
                 "sum_subject_enron_event" = "Enron Event email subject",
                 "sum_email_business_process" = "Business process email text",
                 "sum_email_core_business" = "Core business email text",
                 "sum_email_meeting" = "Meeting email text",
                 "sum_email_enron_event" = "Enron's event email text")

Because the number of line which contain email description is lower than the length of the table the research of the keyword about Enron event in the email create many NA value. To be able to compute the sum of the email which contain those word we use the parameter na.rm = TRUE which consider the NA as it is a 0 in the data set to compute the sum.

#compute the sum of each topics for each month of each year study
email_subject_send_graph <- email_subject_send %>% 
  group_by(year_month) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, subject, reference, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)



#display the different topic trend in the email subject over the study's period
email_subject_send_graph %>% select(year_month, starts_with("sum_subject_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "topics",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=topics))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email subject topics",
    title = "Email subject analysis over the study period",
       x = "Study period",
       y = "Number of emails per topic") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[1:4],
    labels = topic_label[1:4])

We can see that:

  • We could see that, the highest spikes for all topics are from October, 2001 to April, 2002.

  • The top topic is about the meeting; then we have the business process and the business core.

  • For the meeting, we have 3 spikes:

    • One between October 2000 and January 2001, maybe to organize the new year and close the past year.

    • One between April and July 2001, which is the period when the head of the company starts to worry about the business process.

    • The highest peak is between October 2001 and January 2002, the period when the fiscal fraud was discovered by the federal agency.

  • For the business process and core topics, we see 2 spikes which follow the last 2 spikes of the meeting topics. This suggests the topic of the meeting concerns the business. We could think those meetings are more related to the business process than the business core.

  • The emails about the Enron event are the fewest, but we can see a peak of the topic from October 2001 to around February 2002. This makes sense with the known event where the company was put in bankruptcy at this period.

For the email subject we look at the frequency of the word we search in them.

#the list of word research in the subject
word_list <- list("message","origin","pleas","email","thank","attach","file","copi","inform","receiv","thank","time","meet",
                  "look","week","dont","vinc","talk","enron","deal","agreement","chang","contract","corp","fax","houston","america",
                  "risk","analy","confidential","correction", "market","gas","price","power","company","energy","trade","busi","servic","manag",
               "bankrup","SEC","MTM","fear", "investigation", "mark-to-market", "10-K", "losing money", "correction", "phone", "fax", "document", "testimony", "deposition", "witness")

#initiate a vector for registering their frequency
word_count <- c()

##iterate over the list and count the number of time we see each word in the list
for(i in seq_along(word_list)){
  
  search <- as.character(word_list[[i]])
  nb <- sum(str_count(email_subject_send_graph$subject, search))
  
  word_count <- c(word_count, nb)
  
}

#draw a wordcloud which represent the words frequency

wordcloud(word_list, word_count, min.freq = 10 ,max.words=length(word_list), col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = "The top words seen in the email text", col.main = "black",font.main = 2)

To read the heatmap, the words that must be seen are those in dark blue and of the largest size. The words that are less frequently seen are in light blue and have the smallest size. This heatmap highlights the following:

  • The most frequently seen word in that list is ‘meet,’ which aligns with the fact that most email subjects are in the meeting topic category.

  • Additionally, there are many words related to the business processes at Enron, such as deal, agreement, change, and contract.

  • The smaller words are linked to the Enron event, such as bankruptcy, MTM, and SEC. This suggests that the email exchanges are not explicitly about the Enron event. We may find more related content within the email bodies.

#display the different topic trend in the email subject over the study's period
email_subject_send_graph %>% select(year_month, starts_with("sum_email_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "email",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=email))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email topics text",
    title = "Email text analysis over the study period",
       x = "Study period",
       y = "Number of emails per topic") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[5:8],
    labels = topic_label[5:8])

In the email content, we can see that:

  • For all topics investigated, we find a peak of emails containing them from April 2001 to April 2002, which is related to a peak in email exchange as we saw earlier in this analysis. Additionally, this period is when the company was under SEC investigation and, in late 2001/early 2002, the bankruptcy process.

  • The emails mostly contain words about meetings. Then we find words related to business processes. Surprisingly, we don’t find many emails containing words linked with the Enron event. This suggests that the Enron events were communicated through other means such as fax and phone calls.

Like for the subject we can look at the frequency of each words in the email text:

#reduce the dataset to the row which contain email text
df_reference <- filter(email_subject_send_graph, !is.na(reference))

#initiate the list for storing the count for each words
email_words_freq <- c()

#loop allowing to extract the words in each email text and count the number of type they are found
for(i in seq_along(word_list)){
  
  word <- as.character(word_list[[i]])
  #we pass through a locate to return in a list the index of the row where we find them
  counting <- as.list(str_locate(df_reference$reference, word))
  
  #we count the index for which we don't have NA
  nb <- sum(!is.na(counting))
  
  #store the frequency for each words in the email text
  email_words_freq <- c(email_words_freq, nb)
  
}

#draw the wordcloud with the frequency of each word
wordcloud(word_list, email_words_freq, min.freq = 10 ,max.words=length(word_list), col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = "The top words seen in the email text", col.main = "black",font.main = 2)

This heatmap is read like for the email subject, this one show us:

  • The top word are enron and please which are related to meeting and enron business process.

  • The word the must seen after that are relate to meeting (attach, inform, receiv). Then we find word link with the business process such as contract, chang, confidential. We find often the words fax and phone suggested in the email refer to phone call or fax which let us thinking they at this time communicate a lot through this way.

Then we look at the number of email received during the study period about those topics.

email_subject_rec <- df_message_status %>% distinct(date, year, month, recipient, status_recipient, subject, reference) %>%
  mutate(#count the number of email which contain at least one word in the list of each topic
    subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
    subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
    subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
    subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
    email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
    email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
    email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
    email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
    #to get the date in year/month
    year_month = as.Date(paste0(year,"-",month,"-01"))) 
#compute the sum of each topics for each month of each year study
email_subject_rec_graph <- email_subject_rec %>% 
  group_by(year_month) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, subject, reference, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)



#display the different topic trend in the email subject over the study's period
email_subject_rec_graph %>% select(year_month, starts_with("sum_subject_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "topics",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=topics))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email subject topics",
    title = "Email received subject analysis over the study period",
       x = "Study period",
       y = "Number of emails per topic") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[1:4],
    labels = topic_label[1:4])

Here for the subject of the email received we distinct two spikes for each subject, the 1st from July, 2000 to July, 2001 and 2nd from August, 2001 to April, 2002. This 2 spikes are included in the 3 spikes seen in the email send. For the topics, we see the same pattern as for the email sent.

#display the different topic trend in the email subject over the study's period
email_subject_rec_graph %>% select(year_month, starts_with("sum_email_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "email",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=email))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email text topics",
    title = "Email received text analysis over the study period",
       x = "Study period",
       y = "Number of emails per topic") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[5:8],
    labels = topic_label[5:8])

For the email received about those topics/keywords we see a similar pattern than the email send suggesting their are exchange.

To go deeper in the email content analysis we next look at the topics and key words find in function of the worker status.

For that we create a similar data frame than the previous but by making the count of topics/email in function of the employee status.

status_email_subject_send <- email_subject_send %>% 
  #we focus on the worker which their status are know
  filter(!is.na(status_sender)) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_sender) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, status_sender, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)

#pivot the data frame
status_email_subject_send <- status_email_subject_send %>%
  pivot_longer(
    cols = 3:length(status_email_subject_send),
    names_to = "topic_email",
    values_to = "value")
#the list of status in the Enron company
status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")

#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(status_list)){
  #assign the status to the variable
  status <- status_list[i]
  
  #the plot related to that status
  p <- status_email_subject_send %>% filter(status_sender == status) %>%
         ggplot(aes(year_month, value, color = topic_email))+
         geom_line(size = 1)+
         labs(color = "Email topics (subject & text)",
           title = paste("Email sent by", status, ", subject and text analysis"),
           y = "Number of emails per topic",
           x = "Study period")+
      scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#loop create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

By analyzing the email subject and the email content in function of Enron’s worker status, we can see that:

  • Every status shows a peak of emails about those topics from April 2001 to January 2002. Also, the top topic for all is the meeting, followed by the business process. Moreover, the tendency we see for the email text is similar for the email’s subject.

  • The pattern of the emails sent by the employees follows the topics we saw for Enron’s workers previously. After emails about meetings, we see an important number of emails about the business process, with fewer about the core business. This could be linked with the investigation where employees send emails about the process they are involved in.

  • For the in-house lawyers, we can see two spikes of emails in 2001 regarding meetings and business processes. The first is from February 2001 to July 2001, and the second is from August 2001 to November 2001. These two periods are linked to the investigation by the SEC. We could think that these emails are for managing the investigation.

  • For the managing director, before June 2001, we can’t really distinguish any top topic in the email content and subject. After that, and until December 2001, we have a peak of emails talking about meetings, business processes, and core business. Here, both business topics seem to be at the same level. We see a similar tendency for the manager. We can think that, during this period, the managers have a lot of meetings to manage both sides of Enron’s businesses.

  • The traders send a significant number of emails about the core business and processes from July 2001 to March 2002. They speak a little about the Enron event.

  • Surprisingly, the CEO shows a significant peak of emails related to meetings, core business, and processes from December 2000 to May 2001, and then from November 2001 to January 2002. We can see a slight peak of emails speaking about the Enron event during these two periods, but the count for them is less than other statuses. This suggests they are not really involved in the email exchange during the SEC investigation, or less so than other Enron worker statuses. Perhaps, the email text we have isn’t exhaustive; maybe the emails about those events aren’t public, or most of this communication by the CEO is managed by other means such as phone calls and fax.

  • For other statuses at the head of the company (President and Vice-president), we can see that we have a peak of emails at the end of 2001 and the start of 2002. The highest peak, after the meeting topic, is linked to the business topics. Additionally, we see more emails that speak about the Enron event compared to the CEO. This suggests that they are more involved in the general management of the company as well as the Enron events than the CEO.

We do the same for the email received:

status_email_subject_rec <- email_subject_rec %>%
  #we focus on the worker which their status are know
  filter(!is.na(status_recipient)) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_recipient) %>%
   mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, status_recipient, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#pivot the data frame
status_email_subject_rec <- status_email_subject_rec %>%
  pivot_longer(
    cols = 3:length(status_email_subject_rec),
    names_to = "topic_email",
    values_to = "value")
#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(status_list)){
  #assign the status to the variable
  status <- status_list[i]
  
  #the plot related to that status
  p <- status_email_subject_rec %>% filter(status_recipient == status) %>%
         ggplot(aes(year_month,value, color = topic_email))+
         geom_line(size = 1) +
      scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+    
         labs(color = "Email topics (subject & text)",
           title = paste("Email received by", status, ", subject and text analysis"),
           y = "Number of emails per topic",
           x = "Study period")+
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

When we look at the emails received, we can see that:

  • The pattern for the emails received looks the same as the one for the emails sent, suggesting most are email exchanges about the same subject. In the emails received for every status, we can see more emails that speak about the Enron event, suggesting that people in the company are aware of what happened. However, these emails might contain information on what happened or directions to follow in response to potential questions from the investigators.

  • The CEO received more emails than they sent. They especially received a significant number of emails about meetings. Perhaps due to their position, they are informed of all or most of the meetings conducted in the company. During the Enron event, they seemed to receive a large number of emails about the core business and processes. This might be to keep them informed about what was happening in the company.

This email text and subject analysis highlight that different statuses inform about what happens in the company, from the processes used for the business to the management of the investigation as well as the bankruptcy. The head of the company seems to be more informed than active in the email exchange about the Enron event management. It seems that both business parts of the company could be more managed by the president and vice-president than the CEO. The in-house lawyers are more active in email exchange during the investigation by SEC and the bankruptcy, perhaps from a legal point of view.

Like for all the worker in the company we will look per status which are the words in the topics investigate which are the must see in the email subject or text. Here, we focus on the top 10 words find in both subject and text.

#Loop allowing to draw the wordcloud with the top 10 words find in email subject/text send by each status
for(i in seq_along(status_list)){
  
  status <- status_list[i]

df <- email_subject_send %>%
  #we focus on the worker which their status are know
  filter(status_sender == status) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_sender) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0) | (sum_email_business_process != 0) | (sum_email_core_business != 0) | (sum_email_meeting != 0) | (sum_email_enron_event != 0)) %>%
  #keep one line per year and month
  distinct(status_sender, subject, reference)

#initiate the liste for storing the count for each words in text and subject
email_words_freq <- c()
subject_freq <- c()

#loop allowing to extract the words in each email text and count the number of type they are found
for(j in seq_along(word_list)){
  
  word <- as.character(word_list[[j]])
  #count for the subject
  counting_subject <- sum(str_count(df$subject, word))
  
  subject_freq <- c(subject_freq, counting_subject)
  
   #we pass through a locate to return in a list the index of the row where we find them
  counting_text <- as.list(str_locate(df$reference, word))
  
  #we count the index for which we don't have NA
  nb <- sum(!is.na(counting_text))
  
  #store the frequency for each words in the email text
  email_words_freq <- c(email_words_freq, nb)
  
}

#for each status we make a total with the count from the subject and the text
total_count <- subject_freq + email_words_freq

#draw the wordcloud with the frequency of each word, only the top 10
wordcloud(word_list, total_count, min.freq = 10 ,max.words= 10,scale = c(3, 0.5) ,col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = paste0("Top 10 words in the email sent by ",status), col.main = "black", font.main = 2)


}

This last analysis for the email sent highlights that:

  • For all statuses, the top words are related to the meeting topics.

  • The employees and traders also speak about contracts, which we associate with the business process. Maybe this is because they are involved in this step of the Enron business.

  • The CEOs are the only status with more words related to business than meetings in their email subjects and texts. This suggests they send more emails about business compared to organizing meetings.

#Loop allowing to draw the wordcloud with the top 10 words find in email subject/text received by each status
for(i in seq_along(status_list)){
  
  status <- status_list[i]

df <- email_subject_rec %>%
  #we focus on the worker which their status are know
  filter(status_recipient == status) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_recipient) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0) | (sum_email_business_process != 0) | (sum_email_core_business != 0) | (sum_email_meeting != 0) | (sum_email_enron_event != 0)) %>%
  #keep one line per year and month
  distinct(status_recipient, subject, reference)

#initiate the list for storing the count for each words in text and subject
email_words_freq <- c()
subject_freq <- c()

#loop allowing to extract the words in each email text and count the number of type they are found
for(j in seq_along(word_list)){
  
  word <- as.character(word_list[[j]])
  #count for the subject
  counting_subject <- sum(str_count(df$subject, word))
  
  subject_freq <- c(subject_freq, counting_subject)
  
   #we pass through a locate to return in a list the index of the row where we find them
  counting_text <- as.list(str_locate(df$reference, word))
  
  #we count the index for which we don't have NA
  nb <- sum(!is.na(counting_text))
  
  #store the frequency for each words in the email text
  email_words_freq <- c(email_words_freq, nb)
  
}

#for each status we make a total with the count from the subject and the text
total_count <- subject_freq + email_words_freq

#draw the wordcloud with the frequency of each word, only the top 10
wordcloud(word_list, total_count, min.freq = 10 ,max.words= 10,scale = c(3, 0.5), col=colorRampPalette(c("#cce5ff", "#3399ff", "#003366"))(length(word_list)), rot.per = 0.3)
title(main = paste0("Top 10 words in the email received by ",status), col.main = "black", font.main = 2)

}

In the emails received, the top 10 words fall into the same topic categories as those in the sent emails. For the CEO, we observe more words about meetings compared to business matters in their top 10. This suggests that the CEO is well-informed about the content of the meetings, such as reports on various topics, but tends to give directions for the core business processes. This is logical given their position.

This analysis highlights that in the emails where the subject and/or text contains the words we searched for, associated with specific topics, the top words are related to meetings. This makes sense when we see the peak of those topics for each status. We could infer that these email exchanges are related to meetings for managing the Enron event as well as the business aspects of the company.

The next is to find the person in the company who is the most active in the email exchange. For that, we start by counting the number of email send per each worker and return the 10 persons who send the highest number.

#Display the top 10 email address of sender
p1 <- df_message_status %>% 
  #keep distinct exchange
  distinct(sender, subject, recipient, .keep_all = TRUE) %>%
  group_by(sender)%>% count() %>% #to count the number of email send per email address
  ungroup() %>%
  #calculate the percentage for each sender
  mutate(perc = round(`n`/sum(`n`),3),
  labels = scales::percent(perc)) %>% 
  arrange(desc(n)) %>% head(10) %>% #to get only the 10 email address with the most important number of email send
  #bar chart
  ggplot(aes(reorder(sender, perc, sum), perc, fill = sender)) +
  geom_bar(stat="identity") +
  coord_flip() +
  #graph title and label
  geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
  scale_y_continuous(labels = scales::percent_format())+  
  labs(title = "Top 10 Enron's employee email sender")+
  xlab("Employee's email address")+
  ylab("Email sent per sender (%)") +
  scale_fill_brewer(palette = "Set3")+
    theme(legend.position = "none",
        plot.margin = margin(10, 10, 10, 20))

#Display the top 10 email address of recipient
p2 <- df_message_status %>% filter(rtype == "TO") %>% #select only the email of the direct concerned receiver
  distinct(sender, recipient, subject, .keep_all = TRUE) %>%
  group_by(recipient)%>% count() %>% #to count the number of email send per email address
  ungroup() %>%
  #calculate the percentage for each sender
  mutate(perc = round(`n`/sum(`n`),4),
  labels = scales::percent(perc)) %>% 
  arrange(desc(n)) %>% head(10) %>% #to get only the 10 email address with the most important number of email send
  #bar chart
  ggplot(aes(reorder(recipient, perc, sum), perc, fill = recipient)) +
  geom_bar(stat="identity") +
  coord_flip() +
  #graph title and label
  geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
  scale_y_continuous(labels = scales::percent_format())+ 
  labs(title = "Top 10 Enron's employee email receiver",
       subtitle = "Only principal receiver")+
  xlab("Employee's email address")+
  ylab("Email received per recipient (%)") +
  scale_fill_brewer(palette = "Set3")+
  theme(legend.position = "none",
        plot.margin = margin(10, 10, 10, 20))

#arrange the plot on the same place
p1 / p2

Jeff Dasovitch seems to be the most active worker in Enron for email exchange where for the period of study it’s him who send the higher proportion of email (3.1%) and received the highest proportion (0.61%).

#return only one result from that query to get the status of the most active sender/recipient
head(df_message_status[df_message_status$sender == "jeff.dasovich@enron.com", "status_sender"], 
     n=1)
## [1] Employee
## 10 Levels: CEO Director Employee In House Lawyer Manager ... Vice President

In the employee data set he is described to be an Employee of Enron. To see if it is really the most active we will compared the number of email send and received by him to the other worker with the same status (Employee) and to all the worker of Enron company.

Compared the number of email sent by the worker who seems to be the more active (Jeff Dasovich), by all worker of it’s status (Employee), and all Enron’s worker.

For that we will compute descriptive comparative statistic between them.

#count the number of email send by jeff dasovich per day
jeff_stat_send <- df_message_status %>% filter(sender == "jeff.dasovich@enron.com") %>%
  #we count the number of different email subject send per day
  group_by(date, subject) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Jeff Dasovich") %>% transform(source = as.factor(source))

#count the number of email send by all sender per day
sender_stat <- df_message_status %>% 
  #we count the number of different email subject send per day by each sender
  group_by(date, sender, subject) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "All sender") %>% select(-sender) %>% transform(source = as.factor(source))

#count the number of email send by Employee status per day
statuts_stat_send <- df_message_status %>% filter(status_sender == "Employee") %>% 
  #we count the number of different email subject send per day by each sender of status employee
  group_by(date, sender, subject) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Employee status") %>% transform(source = as.factor(source))

#combine the rows together to create a unique dataframe and compared the enron's worker and the employee to Jeff
violin_plot1 <- bind_rows(jeff_stat_send, statuts_stat_send)
violin_plot2 <- bind_rows(jeff_stat_send, sender_stat)

#compared the 2 groups per a t.test to see if jeff dasovitch is really most active than the other employee 
p3 <- ggplot(violin_plot1, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  #display the comparative statistic on the violin plot
  stat_compare_means(method = "t.test", label.y = max(violin_plot1$email_count) - 400)+
  labs(title = "Comparison of the email sent between 
       Jeff Dasovitch and the Enron's Employee",
       x = "Source",
       y = "Number of emails") +
  #to better see the violin plot we break the y axis
  scale_y_break(c(100, 3000), scales = 0.3)+
  #set up the color for each resources
  scale_fill_manual(values = c(
      "Jeff Dasovich" = "tomato2",
      "Employee status" = "yellowgreen"))+
  #withdraw the legend form the plot
  theme(legend.position = "none")

#same plot but to compared Jeff Dasovitch to the Enron's worker
p4 <- ggplot(violin_plot2, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  stat_compare_means(method = "t.test", label.y = max(violin_plot2$email_count) - 2000)+
  scale_y_break(c(250, 15000), scales = 0.3)+
  labs(title = "Comparison of the email sent between 
       Jeff Dasovitch and all sender",
       x = "Source",
       y = "Number of emails") +
  scale_fill_manual(#set up the color for each resources
    values = c(
      "Jeff Dasovich" = "tomato2",
      "All sender" = "cyan"))+
  theme(legend.position = "none")

#arrange the plot on the same place
p3 + p4

#display the stat of the different group
violin_plot <- bind_rows(jeff_stat_send, sender_stat, statuts_stat_send)

#Description of the email send by Jeff Dasovich, the Employee, and all
violin_plot %>% group_by(source)%>%
  summarise(
    mean = mean(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 3 × 7
##   source           mean    sd   min    Q1    Q3   max
##   <fct>           <dbl> <dbl> <int> <dbl> <dbl> <int>
## 1 Jeff Dasovich   15.6   45.7     1     1     9   760
## 2 All sender      10.6   80.6     1     1     5 18445
## 3 Employee status  5.49  29.9     1     1     3  3556

The table summarizing the emails sent by the group shows us that:

  • It is Jeff Dasovitch who has the highest average number of emails sent per day. The lowest is for the Enron employees.

  • By looking at the quantiles, which represent respectively the 25% and the 75% of the values, it is also Jeff who has the highest value for quantile 3, especially compared to the Enron employees.

  • Surprisingly, when we look at all the senders, we find the highest number of emails sent in a day. Maybe that is linked to the Enron event.

From this we can deduce that, Jeff Dasovitch is significantly the most active Enron’s worker in the email sending.

Then we look at the email received by Jeff Dasovitch compared to Enron’s worker of the same status and to all Enron’s worker.

#statistics on the jeff dasovich email receive per day
jeff_stat_rec <- df_message_status %>% filter(recipient == "jeff.dasovich@enron.com") %>%
  group_by(date) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Jeff Dasovich") %>% transform(source = as.factor(source))

#statistics on the email send per day by all the recipient
recipient_stat <- df_message_status %>% group_by(date, recipient) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Enron's worker") %>% select(-recipient) %>% transform(source = as.factor(source))

#statistics on the email send per day by the enron's worker who have an employee statuts
statuts_stat_rec <- df_message_status %>% filter(status_recipient == "Employee") %>% group_by(date) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Employee status") %>% transform(source = as.factor(source))

#combine the rows together to create a unique dataframe and compared the enron's worker and the employee to Jeff
violin_plot1 <- bind_rows(jeff_stat_rec, statuts_stat_rec)
violin_plot2 <- bind_rows(jeff_stat_rec, recipient_stat)

#compared the 2 groups per a t.test to see if jeff dasovitch is really most active than the other employee and/or worker in Enron's company
p3 <- ggplot(violin_plot1, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  #compared statisticaly the 2 group to see if the difference is significant or not
  stat_compare_means(method = "t.test", label.y = max(violin_plot1$email_count) + 2)+
  labs(title = "Comparison of the email received between 
       Jeff Dasovitch and the Enron's Employee",
       x = "Source",
       y = "Number of emails") +
  theme(legend.position = "none")+
  scale_fill_manual(#set up the color for each resources
    values = c(
      "Jeff Dasovich" = "tomato2",
      "Employee status" = "yellowgreen"
    ))

p4 <- ggplot(violin_plot2, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  ylim(c(-10,350))+
  stat_compare_means(method = "t.test", label.y = 300)+
  labs(title = "Comparison of the email received between 
       Jeff Dasovitch and all recipient",
       x = "Source",
       y = "Number of emails") +
  theme(legend.position = "none")+
  scale_fill_manual(#set up the color for each resources
    values = c(
      "Jeff Dasovich" = "tomato2",
      "All recipient" = "cyan"
    ))

#arrange the plot on the same place
p3 + p4

violin_plot <- bind_rows(jeff_stat_rec, recipient_stat, statuts_stat_rec)

#Description of the email received 
violin_plot %>% group_by(source) %>%
  summarise(
    mean = mean(email_count),
    median = median(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 3 × 8
##   source           mean median     sd   min    Q1    Q3   max
##   <fct>           <dbl>  <dbl>  <dbl> <int> <dbl> <dbl> <int>
## 1 Jeff Dasovich   17.5      10  19.2      1     3   25    113
## 2 Enron's worker   3.19      2   6.36     1     1    3   1153
## 3 Employee status 98.6      40 156.       1     7  122.  1333

When we look at the number of emails received, Jeff Dasovich received significantly more emails on average than another Enron worker. However, when compared to other employees, he did not receive more emails than some others. On the contrary, he received significantly fewer than some. For the Enron workers, the mean is far from the median, suggesting that extreme values exist within that group. The violin plot for the employees highlights this, where we can see that above the 3rd quartile, there is a long tail starting around 120 and becoming extremely thin after 250. Conversely, for Jeff Dasovich’s violin plot, above the 3rd quartile, the tail does not become finer but seems to consistently have a significant number of observations with these high values. All of this suggests that for the employees, some events caused them to receive an extremely high number of emails, a peak that is not seen for Jeff Dasovich.

From this part of the analyze we can say that:

- Jeff Dasovich is the Enron worker who send and received the highest number of email.

- Compared to other worker with an employee status he significantly send more email but he received less.

- It is possible that, it has some events whose made other employee than Jeff Dasovich receiving more email in one day. We could thing Jeff Dasovich is one of the employee who receive the most email per day but not the only one.

We can conclude that Jeff Dasovich is more active than passive in the email exchange and is the sender with the highest number of emails per day during the study period.

#global environment cleaning
rm(grid_plot, i, j, n, no_legend, p, p3, p4, plot_list, plot_on_the_page, plot_per_section, plots_with_legend, status, status_list,
   status_email_subject, adjacencyData_99, adjacencyData_00, adjacencyData_01, adjacencyData_02, word, word_count, nb, legend, email_words_freq, counting, search, df, total_count, email_words_freq, subject_freq)

In addition, on the Enron scandal wikipedia page we find a list of person involved in the Enron scandal. We will research them in the data set to see if we can analyse the subject of the email they send as well as if they play a role in the Enron scandal. source: wikipedia page about Enron timeline downfall.

We find: - Kenneth Lay: he was the founder, chief executive officer, and the chairman of Enron and was heavily involved in Enron’s scandal.

  • Jeffrey Skilling: he was the CEO of the company during the scandal and deeply involved in the fraud.

  • Andrew Fastow: he was the chief financial officer and was fired shortly before the bankruptcy.

  • Lea Fastow: she was the secretary of treasure in Enron and the wife of Andrew Fastow.

  • Timothy Belden: he was the head of trading in Enron company.

  • Vincent Kaminski: he work in Enron as the head of the quantitative modelling group.

  • Jordan Mintz: he is a former managing director for the corporate tax at Enron

  • Sherron Watkins: she was one of the vice-president in Enron

  • Richard Causey: he was an accounting officer of Enron

  • Greg Whalley: he was an enron executive.

From this list we add Jeff Dasovich who isn’t find in the wikipedia page but we find it to be the most active employee in the email sending. Maybe, he could be participate at some exchange related to the Enron’s events.

#to find the person involved in the fiscal fraud we use str_detect to see if we can find them in the data set
#for example here for Vincent Kaminski
people_of_interest <- df_message_status %>% filter(str_detect(sender,"kaminski"))

First we construct the data set for the email send and received by each Enron worker know for being involved in the fraud.

#email send:
person_of_interest_send <- email_subject_send %>%
  filter(str_detect(sender,"jeff.dasovich|andrew.baker|tim.belden|andrew.fastow|lfastow|vkaminski|jordan.mintz|jeff.skilling|sherron.watkins|richard.causey|greg.whalley")) %>%
  mutate(
    #identify the person who sent the email
    email_label_sender = case_when(
      sender == "jeff.dasovich@enron.com" ~ "Jeff Dasovich",
      sender == "kenneth.lay@enron.com" ~ "Kenneth Lay",
      sender == "jeff.skilling@enron.com" ~ "Jeffrey Skilling",
      sender == "andrew.baker@enron.com" ~ "Andrew Baker",
      sender == "tim.belden@enron.com" ~ "Timothy Belden", 
      sender %in% c("lfastow@pop.pdq.net", "lfastow@pdq.net") ~ "Lea Fastow",
      sender == "andrew.fastow@enron.com" ~ "Andrew Fastow",
      sender %in% c("vkaminski@enron.com", "vkaminski@aol.com", "vkaminski@palm.net") ~ "Vincent Kaminski",
      sender == "jordan.mintz@enron.com" ~ "Jordan Mintz",
      sender == "sherron.watkins@enron.com" ~ "Sherron Watkins",
      sender == "richard.causey@enron.com" ~ "Richard Causey", 
      sender == "greg.whalley@enron.com" ~ "Greg Whalley", 
      .default = sender))

#email received
person_of_interest_reciveid <- email_subject_rec %>%
  filter(str_detect(recipient,"jeff.dasovich|andrew.baker|tim.belden|andrew.fastow|lfastow|vkaminski|jordan.mintz|jeff.skilling|sherron.watkins|richard.causey|greg.whalley")) %>%
  mutate(
    #identify the person who sent the email
    email_label_recipient = 
      case_when(
        recipient %in% c("jeff.dasovich@enron.com","jeff_dasovich@ees.enron.com") ~ "Jeff Dasovich",
        recipient == "kenneth.lay@enron.com" ~ "Kenneth Lay",
        recipient %in% c("jeff.skilling@enron.com","jeff_skilling@enron.com") ~ "Jeffrey Skilling",
        recipient == "andrew.baker@enron.com" ~ "Andrew Baker",
        recipient %in% c("tim.belden@enron.com", "tim_belden@pgn.com") ~ "Timothy Belden",
        recipient %in% c("lfastow@pop.pdq.net", "lfastow@pdq.net") ~ "Lea Fastow",
        recipient %in% c("andrew.fastow@enron.com", "andrew.fastow@ljminvestments.com") ~ "Andrew Fastow",
        recipient %in% c("vkaminski@enron.com", "vkaminski@aol.com","vkaminski@aol .com", "vkaminski@palm.net",
                         "vkaminski@ol.com", "vkaminski@aol .com", "vkaminski@aol .com") ~ "Vincent Kaminski",
        recipient %in% c("jordan.mintz@enron.com","jordan_mintz@enron.com") ~ "Jordan Mintz",
        recipient == "sherron.watkins@enron.com" ~ "Sherron Watkins",
        recipient == "richard.causey@enron.com" ~ "Richard Causey", 
        recipient == "greg.whalley@enron.com" ~ "Greg Whalley", 
        .default = recipient)) 

We look at the number of email send/received for each person studied:

The email send each month:

#create a list with the name of each person we want to study
enron_worker_send <- unique(person_of_interest_send$email_label_sender)
  
#initiate the list to store the plot  
worker_send_plot <- list()

#loop allowing to construct a bar plot to display per month the number of email send by each person study
for(i in seq(enron_worker_send)){
  
  worker <- enron_worker_send[i]
  
  p <- person_of_interest_send %>% filter(email_label_sender == worker) %>%
  group_by(year,month) %>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email sent per month for each year by", worker),
       y = "Number of emails")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  worker_send_plot[[i]] <- p}

worker_send_plot
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

The email received each month:

#liste of person study
enron_worker_rec <- unique(person_of_interest_reciveid$email_label_recipient)

  #loop allowing to construct a bar plot to display per month the number of email received by each person study
worker_rec_plot <- list()

for(i in seq(enron_worker_rec)){
  
  worker <- enron_worker_rec[i]
  
  p <- person_of_interest_reciveid %>% filter(email_label_recipient == worker) %>%
  group_by(year,month) %>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email received per month for each year by", worker),
       y = "Number of emails")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  worker_rec_plot[[i]] <- p}

worker_rec_plot
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

When we look at the number of emails received/sent by Enron workers known for being involved in the Enron event, we can see they sent fewer emails than they received. Moreover, the pattern of each follows the general pattern of the workers in the Enron company. For all, we find principally the emails are sent or received in 2001 suggesting they are active in the email exchange during the SEC investigation as well as the bankruptcy. By adding Jeff Dasovich, whom we identified earlier to be the most active sender, we see that he is the most active in this group of people working at Enron. The least active senders in this group are Sherron Watkins, Andrew Baker, Andrew, and Laura Fastow.

Then we look at the number of email send about the topics and key words we have identify.

#extract the worker who are interesting to follow and compute the number of email send by them
person_of_interest_send_subject <- person_of_interest_send %>%
  #to compute the number of email sent in each topics by the person whose are directly involved in the Enron scandal
  group_by(year_month, email_label_sender) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, email_label_sender, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)


#pivot the table
person_of_interest_send_subject <-person_of_interest_send_subject %>%
  pivot_longer(
  cols = 3:length(person_of_interest_send_subject),
  names_to = "topic_email",
  values_to = "value"
)

For each Enron’s worker know for being involved in the different Enron’s events we will look at the number of email by create a bar plot to follow the evolution of the topics discuss over the period of study

#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(enron_worker_send)){
  #assign the status to the variable
  worker <- enron_worker_send[i]
  
  #the plot related to that status
  p <- person_of_interest_send_subject %>% filter(email_label_sender == worker) %>% 
    ggplot(aes(year_month,value, color = topic_email))+
         geom_line(size = 1) +
         labs(color = "Email topics (subject & text)",
           title = paste("Email sent by", worker, "subject and text analysis"),
           y = "Number of emails per topic",
           x = "Study period")+
     scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+ 
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

We can see that:

  • For most of the worker study here we can see that they sent email which speak about those topics principally in 2001. This follow the period of the investigation and the bankruptcy.

  • Jeff Dasovich is the most active Enron worker in this shortlist for sending emails. He sends emails about all topics, especially meetings and various business aspects. The highest peak for sending email is from October, 2001 to January 2002 which is the period where the company was under SEC investigation and the bankruptcy process started. We could thing he holds a high level of responsibility within the company. But even at this time the number of email which speak about the enron event is lower than the email which speak about the business process. Maybe he his more involved in the company business than to manage the investigation and or the bankruptcy.

  • The other workers at Enron are pointed out to be involved in the events, but they send fewer emails about these topics (no more than 15). This could be because the email text data aren’t exhaustive, and many of their emails about these topics are censored for the public. Perhaps, most of the time the email which match with their subject are sent in 2001.

  • All of them send emails about meetings, core business, and Enron events. Surprisingly, we find few associated with the core business at Enron. Perhaps these individuals are more active in the business processes than in the regular affairs of the company.

Next we look at the number of email received about those topics

#extract the worker who are interesting to follow and compute the number of email send by them
person_of_interest_reciveid_subject <- person_of_interest_reciveid %>%
  #to compute the number of email sent in each topics by the person whose are directly involved in the Enron scandal
  group_by(year_month, email_label_recipient) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, email_label_recipient, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)

#pivot the table
person_of_interest_reciveid_subject <-person_of_interest_reciveid_subject %>%
  pivot_longer(
  cols = 3:length(person_of_interest_reciveid_subject),
  names_to = "topic_email",
  values_to = "value"
)

Display the email received about those topics for each Enron’s worker knows to be imply in the Enron events

#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(enron_worker_rec)){
  #assign the status to the variable
  worker <- enron_worker_rec[i]
  
  #the plot related to that status
  p <- person_of_interest_reciveid_subject %>% filter(email_label_recipient == worker)%>% 
    ggplot(aes(year_month,value, color = topic_email))+
         geom_line(size = 1) +
    scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
         labs(color = "Email topics (subject & text)",
           title = paste("Email received by", worker, "subject and text analysis"),
           y = "Number of emails per topic",
           x= "Study period")+
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

We can observe that:

  • Like for the email sent those topics are see in email received mostly in 2001 suggesting they are related to the event.

  • All received more emails which are about or speak about the Enron event, and both business parts show they are more informed than active in the email exchange about those topics. This is true for everyone except Jeff Dasovich, who received and sent a similar number of emails related to those topics.

  • Timothy Belden and Vincent Kaminski, after the meeting topic, received more emails about the business process compared to other topics. This may be due to their roles in the company and suggests they are the most informed in this group about the business process.

From this analysis, we can deduce that Jeff Dasovich is highly active in the email exchanges on all the topics investigated here. The other person for whom we looked at the email subject and content seems to be more passive than active in the email exchange. In fact, they send few emails about those topics compared to the number they received. In the emails received, an important part concerns the business process as well as meetings. This suggests that these persons are aware of how the company manages its business and maybe participate in meetings about them. Moreover, because the peak for all in sending and receiving email is from October 2001 to January 2002 we can thing that, they are involved and informed in the investigation and/or the bankruptcy management at various level.

The external exchange

When we start to explore the data set we pointed that, it as average 1% of the email exchange where the sender and the receiver haven’t a Enron email address. Potential those person are external to the company and could speak about the event. We can imagine that, external person involved in internal email exchange could speak about what does the Enron worker in the company with external person. In this part we will explore this hypothesis.

#extraction of the email exchange whose not involved the enron worker
extern_email <- df_message_status %>% select(date, year, month, sender, recipient, subject, reference) %>% 
    #count for each the sender and recipient whose have an enron email address
    mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
  count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>% 
    #for each date and subject for each date make the sum of the sender and recipient with an enron email address
    group_by(date, subject) %>% mutate(
      sum_sender = sum(count_sender),
      sum_recipient = sum(count_recipient)) %>% ungroup() %>%
    #isolate the email exchange which not involved person with an enron email address
    filter((sum_sender ==0) & (sum_recipient == 0)) %>% select(-c(count_sender, count_recipient, sum_sender, sum_recipient)) %>%
  #transform all the string variable into factor data type
  transform(sender = as.factor(sender),
            recipient = as.factor(recipient))
summary(extern_email)
##       date              year           month     
##  Min.   :1999-09-19   1999:  870   10     :4347  
##  1st Qu.:2000-12-03   2000: 6879   11     :4209  
##  Median :2001-05-25   2001:15653   12     :3818  
##  Mean   :2001-05-10   2002: 1810   09     :2620  
##  3rd Qu.:2001-10-26                05     :1811  
##  Max.   :2002-12-21                04     :1696  
##                                    (Other):6711  
##                                 sender     
##  owner-eveningmba@haas.berkeley.edu:  910  
##  naftcorp@aol.com                  :  897  
##  jbennett@gmssr.com                :  889  
##  berk@haas.berkeley.edu            :  871  
##  duggar@haas.berkeley.edu          :  761  
##  feedback@intcx.com                :  611  
##  (Other)                           :20273  
##                         recipient    
##  Undisclosed-Recipient       :  838  
##  eveningmba@haas.berkeley.edu:  431  
##  soblander@carrfut.com       :  372  
##  tie_list_server@nyiso.com   :  283  
##  marketing@nymex.com         :  275  
##  linguaphile@wordsmith.org   :  265  
##  (Other)                     :22748  
##                                                                                             subject     
##  Quantitative Finance Update from FinMath.com @ Chicago                                         :  897  
##  NYS Reliability Council Executive Committee                                                    :  515  
##  Brief of Enron Energy Service Inc. on Rate Design -- A. 00-11-038                              :  445  
##  looking for key players to form a founding team of startup                                     :  298  
##  Comments of Enron Energy Services on Proposed and Alternate Decis\tions -- A. 00-11-038, et al.:  230  
##  Errata To the Rate Design Testimony of Enron Energy Services Inc.                              :  214  
##  (Other)                                                                                        :22613  
##   reference        
##  Length:25212      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

By looking at the data summary we can see that:

  • those email seems to be send mostly in 2001 because the median is 2001-05-10 and the 3rd quantile is 2001-10-26.

  • the email address for the sender who appear the most is with a domain of the berkley university. For the recipient we don’t know the email address of the top receiver.

  • on the top subject we can see that 2 of them speak about enron.

This let us think we could investigate more in this email exchange to see if they speak to the Enron event. For that we use the same topic and key word as in the main table.

extern_email_graph <- extern_email %>% distinct(date, year, month, sender, recipient, subject, reference) %>% 
  #filter for the email having in their subject enron
  filter(str_detect(subject, "enron|Enron") | str_detect(reference, "enron|Enron")) %>%
   mutate(#count the number of email which contain at least one word in the list of each topic
    subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
    subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
    subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
    subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
    email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
    email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
    email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
    email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
    #to get the date in year/month
    year_month = as.Date(paste0(year,"-",month,"-01"))) %>%
  group_by(year_month) %>%
   mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, subject, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)

#graph of the email speaking about enron and which could be speaking about enron event/business process  
extern_email_graph %>% select(-subject) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:9,
  names_to = "topics",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=topics))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email topics (subject & text)",
    title = "Email subject and text about enron event",
    subtitle = "Email exchange about Enron between person whose haven't an enron email address",
       x = "Study period",
       y = "Number of emails per topic") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors,
    labels = topic_label)

The graph above show us that, some email between person without enron email address exchange about the Ernon event especially their business process, less speak about the core business of the company. Those email are mostly send between october 2001 and January 2002 which is the period of the Enron fraud investigation by the SEC. Most of the time it is in the email we found person speaking of those event. Because of the period we could think they speak more about the bankruptcy than the investigation itself.

#isolate the subject about enron and their event
Enron_subject <- extern_email_graph %>% 
  filter(str_detect(subject, "enron|Enron")) %>% 
  filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0)) %>% distinct(year_month, subject, .keep_all = TRUE)

#drop the line whose seems to be extern exchange
no_extern <- df_message_status %>% select(date, sender, recipient, subject, reference) %>% 
    #count for each the sender and recipient whose have an enron email address
    mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
  count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>% 
    #for each date and subject for each date make the sum of the sender and recipient with an enron email address
    group_by(date, subject) %>% mutate(
      sum_sender = sum(count_sender),
      sum_recipient = sum(count_recipient)) %>% ungroup() %>%
    #isolate the email exchange which not involved person with an enron email address
    filter((sum_sender !=0) | (sum_recipient != 0)) %>% select(-c(count_sender, count_recipient, sum_sender, sum_recipient)) %>%
  #transform all the string variable into factor data type
  transform(sender = as.factor(sender),
            recipient = as.factor(recipient))

#inner joint with the main table to see if we can find those subject in exchange between enron employee
print(verify <- inner_join(no_extern, Enron_subject, by = "subject"))
##         date                  sender                recipient
## 1 2002-01-04 david.forster@enron.com louise.kitchen@enron.com
## 2 2001-12-07        louise@enron.com         louise@enron.com
##                                                                            subject
## 1                                                            EnronOnline Documents
## 2 NYTimes.com Article: Enron Paid Out  Retention  Bonuses Before Bankruptcy Filing
##   reference year_month sum_subject_meeting sum_subject_business_process
## 1      <NA> 2001-12-01                   0                            0
## 2      <NA> 2001-12-01                   0                            0
##   sum_subject_core_business sum_subject_enron_event sum_email_business_process
## 1                         1                       1                          1
## 2                         1                       1                          1
##   sum_email_core_business sum_email_meeting sum_email_enron_event
## 1                       0                 1                     1
## 2                       0                 1                     1

We can see that 2 subject are find in the external and the data set which look only at the exchange involving person with an enron email address. Those email are send in december 2001 and January 2002, one is from the CEO david foster and is about enron online document, the second is from a louise at enron and is related to an article about the bankruptcy at enron. We can think that, those email had involved person whose are external too the enron company and have spread those information outside the company.

To conclude on the project, we can say that: The Enron company is composed of different statuses which seem to have varying degrees of involvement in the fiscal fraud. The person at the head of the company, as well as the traders and the lawyers, seem to be active participants in the fraud. The other statuses seem to be more aware of it, perhaps not playing a significant role in it. By looking at the people known to be involved in the Enron fiscal fraud, we do not identify many emails sent or received about it, nor about the management of the bankruptcy or the SEC investigation. We can assume they used other means of communication. Given the time, they might have communicated more by phone than email. A brief investigation about potential external exchanges shows that other companies in the US spoke about the Enron event and two emails are directly associated with company internal exchanges. It could be interesting to investigate the email content further by having a more exhaustive dataset about them. This will enhance the knowledge of the Enron event as well as the implication of the different statuses in them.